from:"Dirceu Semighini Filho"

Re: Read Local File

2017-06-14 Thread Dirceu Semighini Filho

Hello Satish,
Thanks for your answer
*I guess you have already made sure that the paths for your file are
exactly the same on each of your nodes*
Yes
* I'd also check the perms on your path*
The files are all with the same permissions
*Believe the sample code you pasted is only for testing - and you are
already aware that a distributed count on a local file has no benefits.*
Have done this to force the file read.
spark.sqlContext.read.text("file:///pathToFile").count

* Typing the path explicitly resolved*
I'll try this, I'm not sure if I've tested the varargs path option.
*Alternately - if the file size is small, you could do spark-submit with a
--files option which will ship the file to every executor and is available
for all executors. *
Unfortunately I can't do this, because it's a continuous reading process

Thanks.



2017-06-14 4:35 GMT-03:00 satish lalam <satish.la...@gmail.com>:

> I guess you have already made sure that the paths for your file are
> exactly the same on each of your nodes. I'd also check the perms on your
> path.
> Believe the sample code you pasted is only for testing - and you are
> already aware that a distributed count on a local file has no benefits.
> Once I ran into a similar issue while copy pasting file paths probably due
> to encoding issues on some text editors. I'd copied a hidden char at the
> end of the path from source file which made my file lookup fail, but the
> code looked perfectly alright. Typing the path explicitly resolved it. But
> this is a corner case.
>
> Alternately - if the file size is small, you could do spark-submit with a
> --files option which will ship the file to every executor and is available
> for all executors.
>
>
>
>
> On Tue, Jun 13, 2017 at 11:02 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi all,
>> I'm trying to read a File from local filesystem, I'd 4 workstations 1
>> Master and 3 slaves, running with Ambari and Yarn with Spark version*
>> 2.1.1.2.6.1.0-129*
>>
>> The code that I'm trying to run is quite simple
>>
>> spark.sqlContext.read.text("file:///pathToFile").count
>>
>> I've copied the file in all 4 workstations and every time that I try to
>> run this I got the following exception:
>> 17/06/13 17:57:37 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID
>> 12, ip, executor 1): java.io.FileNotFoundException: File file:pathToFile
>> does not exist
>> It is possible the underlying files have been updated. You can explicitly
>> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
>> in SQL or by recreating the Dataset/DataFrame involved.
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.nextIterator(FileScanRDD.scala:175)
>> at org.apache.spark.sql.execution.datasources.FileScanRDD$$
>> anon$1.hasNext(FileScanRDD.scala:109)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
>> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
>> eratedIterator.processNext(Unknown Source)
>> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
>> BufferedRowIterator.java:43)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
>> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
>> rite(BypassMergeSortShuffleWriter.java:126)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:96)
>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:53)
>> at org.apache.spark.scheduler.Task.run(Task.scala:99)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 17/06/13 17:57:37 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4
>> times; aborting job
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 6.0 (TID 15, ip, executor 1): java.io.FileNotFoundException: File
>> file:file:pathToFile does not exist
>> It is possible the underlying files have been updated. You can explicitly
>> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
>> in SQL or by recreating the Dataset/DataFrame invo

Read Local File

2017-06-13 Thread Dirceu Semighini Filho

Hi all,
I'm trying to read a File from local filesystem, I'd 4 workstations 1
Master and 3 slaves, running with Ambari and Yarn with Spark version*
2.1.1.2.6.1.0-129*

The code that I'm trying to run is quite simple

spark.sqlContext.read.text("file:///pathToFile").count

I've copied the file in all 4 workstations and every time that I try to run
this I got the following exception:
17/06/13 17:57:37 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 12,
ip, executor 1): java.io.FileNotFoundException: File file:pathToFile does
not exist
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
in SQL or by recreating the Dataset/DataFrame involved.
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:175)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

17/06/13 17:57:37 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4 times;
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage
6.0 (TID 15, ip, executor 1): java.io.FileNotFoundException: File
file:file:pathToFile does not exist
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
in SQL or by recreating the Dataset/DataFrame involved.
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:175)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-12 Thread Dirceu Semighini Filho

allCategory: String= null, networkCallType: String=
null,
   chargeIndicator: String= null, typeOfNetwork:
String= null,
   releasingParty: String= null,
   userId: String= null, otherPartyName: String= null,
   otherPartyNamePresentationIndicator: String= null,
   clidPermitted: String= null,
   receivedCallingNumber: String= null, namePermitted:
   String = null) extends TabbedToString




// enriched classes

case class CompletedCDRs(completed: List[RichCDR], incompleted: List[CDR])

case class DonoUnidadeTelefonica(id: Long,
 nome: String)

case class CDR(headerModule: HeaderModule, basicModule: BasicModule = null,
   centerxModule: CenterxModule = null , ipModule: IpModule =
null,
   tgppModule: TgppModule = null,
   partialCallBeginModule: PartialCallBeginModule = null,
   partialCallEndModule: PartialCallEndModule = null) extends
TabbedToString

2017-05-09 4:54 GMT-03:00 Matthew cao <cybea...@gmail.com>:

> Hi,
> I have tried simple test like this:
> case class A(id: Long)
> val sample = spark.range(0,10).as[A]
> sample.createOrReplaceTempView("sample")
> val df = spark.emptyDataset[A]
> val df1 = spark.sql("select * from sample").as[A]
> df.union(df1)
>
> It runs ok. And for nullabillity I thought that issue has been fixed:
> https://issues.apache.org/jira/browse/SPARK-18058
> I think you can check your spark version and schema of dataset again? Hope
> this help.
>
> Best,
>
> On 2017年5月9日, at 04:56, Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> wrote:
>
> Ok, great,
> Well I havn't provided a good example of what I'm doing. Let's assume that
> my case  class is
> case class A(tons of fields, with sub classes)
>
> val df = sqlContext.sql("select * from a").as[A]
>
> val df2 = spark.emptyDataset[A]
>
> df.union(df2)
>
> This code will throw the exception.
> Is this expected? I assume that when I do as[A] it will convert the schema
> to the case class schema, and it shouldn't throw the exception, or this
> will be done lazy when the union is been processed?
>
>
>
> 2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>
>> Yes, unfortunately. This should actually be fixed, and the union's schema
>> should have the less restrictive of the DataFrames.
>>
>> On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>>> HI Burak,
>>> By nullability you mean that if I have the exactly the same schema, but
>>> one side support null and the other doesn't, this exception (in union
>>> dataset) will be thrown?
>>>
>>>
>>>
>>> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>>>
>>>> I also want to add that generally these may be caused by the
>>>> `nullability` field in the schema.
>>>>
>>>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
>>>> shixi...@databricks.com> wrote:
>>>>
>>>>> This is because RDD.union doesn't check the schema, so you won't see
>>>>> the problem unless you run RDD and hit the incompatible column problem. 
>>>>> For
>>>>> RDD, You may not see any error if you don't use the incompatible column.
>>>>>
>>>>> Dataset.union requires compatible schema. You can print ds.schema and
>>>>> ds1.schema and check if they are same.
>>>>>
>>>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>>>>> dirceu.semigh...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>> I've a very complex case class structure, with a lot of fields.
>>>>>> When I try to union two datasets of this class, it doesn't work with
>>>>>> the following error :
>>>>>> ds.union(ds1)
>>>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>>>> Union can only be performed on tables with the compatible column types
>>>>>>
>>>>>> But when use it's rdd, the union goes right:
>>>>>> ds.rdd.union(ds1.rdd)
>>>>>> res8: org.apache.spark.rdd.RDD[
>>>>>>
>>>>>> Is there any reason for this to happen (besides a bug ;) )
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho

Ok, great,
Well I havn't provided a good example of what I'm doing. Let's assume that
my case  class is
case class A(tons of fields, with sub classes)

val df = sqlContext.sql("select * from a").as[A]

val df2 = spark.emptyDataset[A]

df.union(df2)

This code will throw the exception.
Is this expected? I assume that when I do as[A] it will convert the schema
to the case class schema, and it shouldn't throw the exception, or this
will be done lazy when the union is been processed?



2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com>:

> Yes, unfortunately. This should actually be fixed, and the union's schema
> should have the less restrictive of the DataFrames.
>
> On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> HI Burak,
>> By nullability you mean that if I have the exactly the same schema, but
>> one side support null and the other doesn't, this exception (in union
>> dataset) will be thrown?
>>
>>
>>
>> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>>
>>> I also want to add that generally these may be caused by the
>>> `nullability` field in the schema.
>>>
>>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
>>>> This is because RDD.union doesn't check the schema, so you won't see
>>>> the problem unless you run RDD and hit the incompatible column problem. For
>>>> RDD, You may not see any error if you don't use the incompatible column.
>>>>
>>>> Dataset.union requires compatible schema. You can print ds.schema and
>>>> ds1.schema and check if they are same.
>>>>
>>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>>>> dirceu.semigh...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>> I've a very complex case class structure, with a lot of fields.
>>>>> When I try to union two datasets of this class, it doesn't work with
>>>>> the following error :
>>>>> ds.union(ds1)
>>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>>> Union can only be performed on tables with the compatible column types
>>>>>
>>>>> But when use it's rdd, the union goes right:
>>>>> ds.rdd.union(ds1.rdd)
>>>>> res8: org.apache.spark.rdd.RDD[
>>>>>
>>>>> Is there any reason for this to happen (besides a bug ;) )
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho

HI Burak,
By nullability you mean that if I have the exactly the same schema, but one
side support null and the other doesn't, this exception (in union dataset)
will be thrown?



2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:

> I also want to add that generally these may be caused by the `nullability`
> field in the schema.
>
> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> This is because RDD.union doesn't check the schema, so you won't see the
>> problem unless you run RDD and hit the incompatible column problem. For
>> RDD, You may not see any error if you don't use the incompatible column.
>>
>> Dataset.union requires compatible schema. You can print ds.schema and
>> ds1.schema and check if they are same.
>>
>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>>> Hello,
>>> I've a very complex case class structure, with a lot of fields.
>>> When I try to union two datasets of this class, it doesn't work with the
>>> following error :
>>> ds.union(ds1)
>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> Union can only be performed on tables with the compatible column types
>>>
>>> But when use it's rdd, the union goes right:
>>> ds.rdd.union(ds1.rdd)
>>> res8: org.apache.spark.rdd.RDD[
>>>
>>> Is there any reason for this to happen (besides a bug ;) )
>>>
>>>
>>>
>>
>

Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho

Hello,
I've a very complex case class structure, with a lot of fields.
When I try to union two datasets of this class, it doesn't work with the
following error :
ds.union(ds1)
Exception in thread "main" org.apache.spark.sql.AnalysisException: Union
can only be performed on tables with the compatible column types

But when use it's rdd, the union goes right:
ds.rdd.union(ds1.rdd)
res8: org.apache.spark.rdd.RDD[

Is there any reason for this to happen (besides a bug ;) )

Cant convert Dataset to case class with Option fields

2017-04-07 Thread Dirceu Semighini Filho

Hi Devs,
I've some case classes here, and it's fields are all optional
case class A(b:Option[B] = None, c: Option[C] = None, ...)

If I read some data in a DataSet and try to connvert it to this case class
using the as method, it doesn't give me any answer, it simple freeze.
If I change the case class to

case class A(b:B,c:C)
id work nice and return the field values as null.

Option fields aren't supported by the as method or is this an Issue?

Kind Regards,
Dirceu

Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho

Hi Sam
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" 
escreveu:

> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
> DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = 
> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = 
> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>

Re: Time-Series Analysis with Spark

2017-01-11 Thread Dirceu Semighini Filho

Hello Rishabh,
We have done some forecasting, for time-series, using ARIMA in our project,
it's on top of Spark and it's open source
https://github.com/eleflow/uberdata

Kind Regards,
Dirceu

2017-01-11 8:20 GMT-02:00 Sean Owen :

> https://github.com/sryza/spark-timeseries ?
>
> On Wed, Jan 11, 2017 at 10:11 AM Rishabh Bhardwaj 
> wrote:
>
>> Hi All,
>>
>> I am exploring time-series forecasting with Spark.
>> I have some questions regarding this:
>>
>> 1. Is there any library/package out there in community of *Seasonal
>> ARIMA* implementation in Spark?
>>
>> 2. Is there any implementation of Dynamic Linear Model (*DLM*) on Spark?
>>
>> 3. What are the libraries which can be leveraged for time-series analysis
>> with Spark (like spark-ts)?
>>
>> 4. Is there any roadmap of having time-series algorithms like AR,MA,ARIMA
>> in Spark codebase itself?
>>
>> Thank You,
>> Rishabh.
>>
>

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-24 Thread Dirceu Semighini Filho

Hi,
You can start multiple spark apps per cluster. You will have one stream
context per app.


Em 24 de dez de 2016 18:22, "shyla deshpande" 
escreveu:

> Hi All,
>
> Thank you for the response.
>
> As per
>
> https://docs.cloud.databricks.com/docs/latest/databricks_
> guide/index.html#07%20Spark%20Streaming/15%20Streaming%20FAQs.html
>
> There can be only one streaming context in a cluster which implies only
> one streaming job.
>
> So, I am still confused. Anyone having more than 1 spark streaming app in
> a cluster running at the same time, please share your experience.
>
> Thanks
>
> On Wed, Dec 14, 2016 at 6:54 PM, Akhilesh Pathodia <
> pathodia.akhil...@gmail.com> wrote:
>
>> If you have enough cores/resources, run them separately depending on your
>> use case.
>>
>>
>> On Thursday 15 December 2016, Divya Gehlot 
>> wrote:
>>
>>> It depends on the use case ...
>>> Spark always depends on the resource availability .
>>> As long as you have resource to acoomodate ,can run as many spark/spark
>>> streaming  application.
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>> On 15 December 2016 at 08:42, shyla deshpande 
>>> wrote:
>>>
 How many Spark streaming applications can be run at a time on a Spark
 cluster?

 Is it better to have 1 spark streaming application to consume all the
 Kafka topics or have multiple streaming applications when possible to keep
 it simple?

 Thanks


>>>
>

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho

Hello,
We have done some test in here, and it seems that when we use prime number
of partitions the data is more spread.
This has to be with the hashpartitioning and the Java Hash algorithm.
I don't know how your data is and how is this in python, but if you (can)
implement a partitioner, or change it from default, you will get a better
result.

Dirceu

2016-12-14 12:41 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>:

> Since it's pyspark it's just using the default hash partitioning I
> believe.  Trying a prime number (71 so that there's enough CPUs) doesn't
> seem to change anything.  Out of curiousity why did you suggest that?
> Googling "spark coalesce prime" doesn't give me any clue :-)
> Adrian
>
>
> On 14/12/2016 13:58, Dirceu Semighini Filho wrote:
>
> Hi Adrian,
> Which kind of partitioning are you using?
> Have you already tried to coalesce it to a prime number?
>
>
> 2016-12-14 11:56 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>:
>
>> I realise that coalesce() isn't guaranteed to be balanced and adding a
>> repartition() does indeed fix this (at the cost of a large shuffle.
>>
>> I'm trying to understand _why_ it's so uneven (hopefully it helps someone
>> else too).   This is using spark v2.0.2 (pyspark).
>>
>> Essentially we're just reading CSVs into a DataFrame (which we persist
>> serialised for some calculations), then writing it back out as PRQ.  To
>> avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each).
>>
>> The writers end up with about 700-900MB each (not bad).  Except for one
>> which is at 6GB before I killed it.
>>
>> Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like
>> this, almost all about 2MB each:
>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209
>> -i-da71c942-389.gz
>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529
>> -i-01d3dab021b760d29-334.gz
>>
>> (we're aware that this isn't an ideal naming convention from an S3
>> performance PoV).
>>
>> The actual CSV file format is:
>> UUID\tINT\tINT\... . (wide rows - about 300 columns)
>>
>> e.g.:
>> 17f9c2a7-ddf6-42d3-bada-63b845cb33a51481587198750   11213
>> 1d723493-5341-450d-a506-5c96ce0697f01481587198751   11212 ...
>> 64cec96f-732c-44b8-a02e-098d5b63ad771481587198752   11211 ...
>>
>> The dataframe seems to be stored evenly on all the nodes (according to
>> the storage tab) and all the blocks are the same size.   Most of the tasks
>> are executed at NODE_LOCAL locality (although there are a few ANY).  The
>> oversized task is NODE_LOCAL though.
>>
>> The reading and calculations all seem evenly spread, confused why the
>> writes aren't as I'd expect the input partitions to be even, what's causing
>> and what we can do?  Maybe it's possible for coalesce() to be a bit smarter
>> in terms of which partitions it coalesces - balancing the size of the final
>> partitions rather than the number of source partitions in each final
>> partition.
>>
>> Thanks for any light you can shine!
>>
>> Adrian
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Adrian Bridgett* |  Sysadmin Engineer, OpenSignal
> <http://www.opensignal.com>
> _
> Office: 3rd Floor, The Angel Office, 2 Angel Square, London, EC1V 1NY
> Phone #: +44 777-377-8251
> Skype: abridgett  |  @adrianbridgett <http://twitter.com/adrianbridgett>  |
>  LinkedIn link  <https://uk.linkedin.com/in/abridgett>
> _
>

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho

Hi Adrian,
Which kind of partitioning are you using?
Have you already tried to coalesce it to a prime number?


2016-12-14 11:56 GMT-02:00 Adrian Bridgett :

> I realise that coalesce() isn't guaranteed to be balanced and adding a
> repartition() does indeed fix this (at the cost of a large shuffle.
>
> I'm trying to understand _why_ it's so uneven (hopefully it helps someone
> else too).   This is using spark v2.0.2 (pyspark).
>
> Essentially we're just reading CSVs into a DataFrame (which we persist
> serialised for some calculations), then writing it back out as PRQ.  To
> avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each).
>
> The writers end up with about 700-900MB each (not bad).  Except for one
> which is at 6GB before I killed it.
>
> Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like
> this, almost all about 2MB each:
> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209
> -i-da71c942-389.gz
> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529
> -i-01d3dab021b760d29-334.gz
>
> (we're aware that this isn't an ideal naming convention from an S3
> performance PoV).
>
> The actual CSV file format is:
> UUID\tINT\tINT\... . (wide rows - about 300 columns)
>
> e.g.:
> 17f9c2a7-ddf6-42d3-bada-63b845cb33a51481587198750   11213
> 1d723493-5341-450d-a506-5c96ce0697f01481587198751   11212 ...
> 64cec96f-732c-44b8-a02e-098d5b63ad771481587198752   11211 ...
>
> The dataframe seems to be stored evenly on all the nodes (according to the
> storage tab) and all the blocks are the same size.   Most of the tasks are
> executed at NODE_LOCAL locality (although there are a few ANY).  The
> oversized task is NODE_LOCAL though.
>
> The reading and calculations all seem evenly spread, confused why the
> writes aren't as I'd expect the input partitions to be even, what's causing
> and what we can do?  Maybe it's possible for coalesce() to be a bit smarter
> in terms of which partitions it coalesces - balancing the size of the final
> partitions rather than the number of source partitions in each final
> partition.
>
> Thanks for any light you can shine!
>
> Adrian
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-17 Thread Dirceu Semighini Filho

Nice, thank you I'll test this property to see if the error stop;


2016-11-17 14:48 GMT-02:00 Arijit <arij...@live.com>:

> Hi Dirceu,
>
>
> For the append issue we are setting "hdfs.append.support" (from Spark code
> which reads HDFS configuration) to "true" in hdfs-site.xml and that seemed
> to have solved the issue. Of course we are using HDFS which does support
> append. I think the actual configuration Spark should check is
> "dfs.support.append".
>
>
> I believe failure is intermittent since in most cases a new file is
> created to store the block addition event. I need to look into the code
> again to see when these files are created new and when they are appended.
>
>
> Thanks, Arijit
>
>
> --
> *From:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> *Sent:* Thursday, November 17, 2016 6:50:28 AM
> *To:* Arijit
> *Cc:* Tathagata Das; user@spark.apache.org
>
> *Subject:* Re: Spark Streaming Data loss on failure to write
> BlockAdditionEvent failure to WAL
>
> Hi Arijit,
> Have you find a solution for this? I'm facing the same problem in Spark
> 1.6.1, but here the error happens only a few times, so our hdfs does
> support append.
> This is what I can see in the logs:
> 2016-11-17 13:43:20,012 ERROR [BatchedWriteAheadLog Writer]
> WriteAheadLogManager  for Thread: Failed to write to write ahead log after
> 3 failures
>
>
>
>
> 2016-11-08 14:47 GMT-02:00 Arijit <arij...@live.com>:
>
>> Thanks TD.
>>
>>
>> Is "hdfs.append.support" a standard configuration? I see a seemingly
>> equivalent configuration "dfs.support.append" that is used in our
>> version of HDFS.
>>
>>
>> In case we want to use a pseudo file-system (like S3)  which does not
>> support append what are our options? I am not familiar with the code yet
>> but is it possible to generate a new file whenever conflict of this sort
>> happens?
>>
>>
>> Thanks again, Arijit
>> --
>> *From:* Tathagata Das <tathagata.das1...@gmail.com>
>> *Sent:* Monday, November 7, 2016 7:59:06 PM
>> *To:* Arijit
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark Streaming Data loss on failure to write
>> BlockAdditionEvent failure to WAL
>>
>> For WAL in Spark to work with HDFS, the HDFS version you are running must
>> support file appends. Contact your HDFS package/installation provider to
>> figure out whether this is supported by your HDFS installation.
>>
>> On Mon, Nov 7, 2016 at 2:04 PM, Arijit <arij...@live.com> wrote:
>>
>>> Hello All,
>>>
>>>
>>> We are using Spark 1.6.2 with WAL enabled and encountering data loss
>>> when the following exception/warning happens. We are using HDFS as our
>>> checkpoint directory.
>>>
>>>
>>> Questions are:
>>>
>>>
>>> 1. Is this a bug in Spark or issue with our configuration? Source looks
>>> like the following. Which file already exist or who is suppose to set
>>> hdfs.append.support configuration? Why doesn't it happen all the time?
>>>
>>>
>>> private[streaming] object HdfsUtils {
>>>
>>>   def getOutputStream(path: String, conf: Configuration): 
>>> FSDataOutputStream = {
>>> val dfsPath = new Path(path)
>>> val dfs = getFileSystemForPath(dfsPath, conf)
>>> // If the file exists and we have append support, append instead of 
>>> creating a new file
>>> val stream: FSDataOutputStream = {
>>>   if (dfs.isFile(dfsPath)) {
>>> if (conf.getBoolean("hdfs.append.support", false) || 
>>> dfs.isInstanceOf[RawLocalFileSystem]) {
>>>   dfs.append(dfsPath)
>>> } else {
>>>   throw new IllegalStateException("File exists and there is no 
>>> append support!")
>>> }
>>>   } else {
>>> dfs.create(dfsPath)
>>>   }
>>> }
>>> stream
>>>   }
>>>
>>>
>>> 2. Why does the job not retry and eventually fail when this error
>>> occurs? The job skips processing the exact number of events dumped in the
>>> log. For this particular example I see 987 + 4686 events were not processed
>>> and are lost for ever (does not recover even on restart).
>>>
>>>
>>> 16/11/07 21:23:39 ERROR WriteAheadLogManager  for Thread:

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-17 Thread Dirceu Semighini Filho

Hi Arijit,
Have you find a solution for this? I'm facing the same problem in Spark
1.6.1, but here the error happens only a few times, so our hdfs does
support append.
This is what I can see in the logs:
2016-11-17 13:43:20,012 ERROR [BatchedWriteAheadLog Writer]
WriteAheadLogManager  for Thread: Failed to write to write ahead log after
3 failures




2016-11-08 14:47 GMT-02:00 Arijit :

> Thanks TD.
>
>
> Is "hdfs.append.support" a standard configuration? I see a seemingly
> equivalent configuration "dfs.support.append" that is used in our version
> of HDFS.
>
>
> In case we want to use a pseudo file-system (like S3)  which does not
> support append what are our options? I am not familiar with the code yet
> but is it possible to generate a new file whenever conflict of this sort
> happens?
>
>
> Thanks again, Arijit
> --
> *From:* Tathagata Das 
> *Sent:* Monday, November 7, 2016 7:59:06 PM
> *To:* Arijit
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Streaming Data loss on failure to write
> BlockAdditionEvent failure to WAL
>
> For WAL in Spark to work with HDFS, the HDFS version you are running must
> support file appends. Contact your HDFS package/installation provider to
> figure out whether this is supported by your HDFS installation.
>
> On Mon, Nov 7, 2016 at 2:04 PM, Arijit  wrote:
>
>> Hello All,
>>
>>
>> We are using Spark 1.6.2 with WAL enabled and encountering data loss when
>> the following exception/warning happens. We are using HDFS as our
>> checkpoint directory.
>>
>>
>> Questions are:
>>
>>
>> 1. Is this a bug in Spark or issue with our configuration? Source looks
>> like the following. Which file already exist or who is suppose to set
>> hdfs.append.support configuration? Why doesn't it happen all the time?
>>
>>
>> private[streaming] object HdfsUtils {
>>
>>   def getOutputStream(path: String, conf: Configuration): FSDataOutputStream 
>> = {
>> val dfsPath = new Path(path)
>> val dfs = getFileSystemForPath(dfsPath, conf)
>> // If the file exists and we have append support, append instead of 
>> creating a new file
>> val stream: FSDataOutputStream = {
>>   if (dfs.isFile(dfsPath)) {
>> if (conf.getBoolean("hdfs.append.support", false) || 
>> dfs.isInstanceOf[RawLocalFileSystem]) {
>>   dfs.append(dfsPath)
>> } else {
>>   throw new IllegalStateException("File exists and there is no 
>> append support!")
>> }
>>   } else {
>> dfs.create(dfsPath)
>>   }
>> }
>> stream
>>   }
>>
>>
>> 2. Why does the job not retry and eventually fail when this error occurs?
>> The job skips processing the exact number of events dumped in the log. For
>> this particular example I see 987 + 4686 events were not processed and are
>> lost for ever (does not recover even on restart).
>>
>>
>> 16/11/07 21:23:39 ERROR WriteAheadLogManager  for Thread: Failed to write
>> to write ahead log after 3 failures
>> 16/11/07 21:23:39 WARN BatchedWriteAheadLog: BatchedWriteAheadLog Writer
>> failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=1212
>> lim=1212 cap=1212],1478553818985,scala.concurrent.impl.Promise$Defaul
>> tPromise@5ce88cb6), Record(
>> java.nio.HeapByteBuffer[pos=1212 lim=1212 cap=1212],1478553818985,scala.
>> concurrent.impl.Promise$DefaultPromise@6d8f1feb))
>> java.lang.IllegalStateException: File exists and there is no append
>> support!
>> at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(
>> HdfsUtils.scala:35)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter
>> .org$apache$spark$streaming$util$FileBasedWriteAheadLogWrite
>> r$$stream$lzycompute(FileBasedWriteAheadLogWriter.scala:33)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter
>> .org$apache$spark$streaming$util$FileBasedWriteAheadLogWrite
>> r$$stream(FileBasedWriteAheadLogWriter.scala:33)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter
>> .(FileBasedWriteAheadLogWriter.scala:41)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLo
>> gWriter(FileBasedWriteAheadLog.scala:217)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write
>> (FileBasedWriteAheadLog.scala:86)
>> at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write
>> (FileBasedWriteAheadLog.scala:48)
>> at org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apa
>> che$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(
>> BatchedWriteAheadLog.scala:173)
>> at org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$
>> 1.run(BatchedWriteAheadLog.scala:140)
>> at java.lang.Thread.run(Thread.java:745)
>> 16/11/07 21:23:39 WARN ReceivedBlockTracker: Exception thrown while
>> writing record: BlockAdditionEvent(ReceivedBlockInfo(2,Some(987
>>

Re: Writing parquet table using spark

2016-11-16 Thread Dirceu Semighini Filho

Hello,
Have you configured this property?
spark.sql.parquet.compression.codec



2016-11-16 6:40 GMT-02:00 Vaibhav Sinha :

> Hi,
> I am using hiveContext.sql() method to select data from source table and
> insert into parquet tables.
> The query executed from spark takes about 3x more disk space to write
> the same number of rows compared to when fired from impala.
> Just wondering if this is normal behaviour and if there's a way to
> control this.
>
> Best
> Vaibhav.
>
>
> --
> Sent from my iPhone.
>

Can somebody remove this guy?

2016-09-23 Thread Dirceu Semighini Filho

Can somebody remove this guy from the list
tod...@yahoo-inc.com
Just sent a message to the list and received an mail from yahoo saying that
this email doesn't exist anymore.

This is an automatically generated message.

tod...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and
someone will follow up with you shortly.

If you require assistance with a legal matter, please send a message to
legal-noti...@yahoo-inc.com

Thank you!

Regards,
Dirceu

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-23 Thread Dirceu Semighini Filho

Hi Felix,
Just runned your code and it prints

Pi is roughly 4.0

Here is the code that I used as you didn't show what a random is I used the
nextInt()

 val n = math.min(10L * slices, Int.MaxValue).toInt // avoid overflow
val count = context.sparkContext.parallelize(1 until n, slices).map { i
=>
  val random = new scala.util.Random(1000).nextInt()
  val x = random * 2 - 1  //(breakpoint-1)
  val y = random * 2 - 1
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / (n - 1))
context.sparkContext.stop()

Also using the debug it stops into the map (breakpoint 1) before going to
print

2016-09-18 6:47 GMT-03:00 chen yong <cy...@hotmail.com>:

> Dear Dirceu,
>
>
> Below is  our testing codes, as you can see, we have used "reduce" action
> to evoke evaluation. However, it still did not stop at breakpoint-1(as
> shown in the the code snippet) when debugging.
>
>
>
> We are using IDEA  version 14.0.3 to debug.  It very very strange to us.
> Please help us(me and my colleagues).
>
>
> // scalastyle:off println
> package org.apache.spark.examples
> import scala.math.random
> import org.apache.spark._
> import scala.util.logging.Logged
>
> /** Computes an approximation to pi */
> object SparkPi{
>   def main(args: Array[String]) {
>
> val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
> val spark = new SparkContext(conf)
> val slices = if (args.length > 0) args(0).toInt else 2
> val n = math.min(10L * slices, Int.MaxValue).toInt // avoid
> overflow
> val count = spark.parallelize(1 until n, slices).map { i =>
> val x = random * 2 - 1  (breakpoint-1)
> val y = random * 2 - 1
> if (x*x + y*y < 1) 1 else 0
>   }.reduce(_ + _)
> println("Pi is roughly " + 4.0 * count / (n - 1))
> spark.stop()
>   }
> }
>
>
>
>
> --
> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> *发送时间:* 2016年9月16日 22:27
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: it does not stop at breakpoints which is in an anonymous
> function
>
> No, that's not the right way of doing it.
> Remember that RDD operations are lazy, due to performance reasons.
> Whenever you call one of those operation methods (count, reduce, collect,
> ...) they will execute all the functions that you have done to create that
> RDD.
> It would help if you can post your code here, and also the way that you
> are executing it, and trying to debug.
>
>
> 2016-09-16 11:23 GMT-03:00 chen yong <cy...@hotmail.com>:
>
>> Also, I wonder what is the right way to debug  spark program. If I use
>> ten anonymous function in one spark program, for debugging each of them, i
>> have to place a COUNT action in advace and then remove it after debugging.
>> Is that the right way?
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>> *发送时间:* 2016年9月16日 21:07
>> *收件人:* chen yong
>> *抄送:* user@spark.apache.org
>> *主题:* Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in
>> an anonymous function
>>
>> Hello Felix,
>> No, this line isn't the one that is triggering the execution of the
>> function, the count does that, unless your count val is a lazy val.
>> The count method is the one that retrieves the information of the rdd, it
>> has do go through all of it's data do determine how many records the RDD
>> has.
>>
>> Regards,
>>
>> 2016-09-15 22:23 GMT-03:00 chen yong <cy...@hotmail.com>:
>>
>>>
>>> Dear Dirceu,
>>>
>>> Thanks for your kind help.
>>> i cannot see any code line corresponding to ". retrieve the data
>>> from your DataFrame/RDDs". which you suggested in the previous replies.
>>>
>>> Later, I guess
>>>
>>> the line
>>>
>>> val test = count
>>>
>>> is the key point. without it, it would not stop at the breakpont-1,
>>> right?
>>>
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>>> *发送时间:* 2016年9月16日 0:39
>>> *收件人:* chen yong
>>> *抄送:* user@spark.apache.org
>>> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> Hi Felix,
>>> Are sure your n is greater than 0?
>>> Here it stops first at breakpoint 1, image attached.
>

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho

Sorry, it wasn't the count it was the reduce method that retrieves
information from the RDD.
I has to go through all the rdd values to return the result.


2016-09-16 11:18 GMT-03:00 chen yong <cy...@hotmail.com>:

> Dear Dirceu,
>
>
> I am totally confused . In your reply you mentioned ".the count does
> that, ..." .However, in the code snippet shown in  the attachment file 
> FelixProblem.png
> of your previous mail,  I cannot find any 'count' ACTION is called.  Would
> you please clearly show me the line it is which triggeres the evaluation.
>
> Thanks you very much
> ------
> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> *发送时间:* 2016年9月16日 21:07
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in
> an anonymous function
>
> Hello Felix,
> No, this line isn't the one that is triggering the execution of the
> function, the count does that, unless your count val is a lazy val.
> The count method is the one that retrieves the information of the rdd, it
> has do go through all of it's data do determine how many records the RDD
> has.
>
> Regards,
>
> 2016-09-15 22:23 GMT-03:00 chen yong <cy...@hotmail.com>:
>
>>
>> Dear Dirceu,
>>
>> Thanks for your kind help.
>> i cannot see any code line corresponding to ". retrieve the data from
>> your DataFrame/RDDs". which you suggested in the previous replies.
>>
>> Later, I guess
>>
>> the line
>>
>> val test = count
>>
>> is the key point. without it, it would not stop at the breakpont-1, right?
>>
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>> *发送时间:* 2016年9月16日 0:39
>> *收件人:* chen yong
>> *抄送:* user@spark.apache.org
>> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
>> anonymous function
>>
>> Hi Felix,
>> Are sure your n is greater than 0?
>> Here it stops first at breakpoint 1, image attached.
>> Have you got the count to see if it's also greater than 0?
>>
>> 2016-09-15 11:41 GMT-03:00 chen yong <cy...@hotmail.com>:
>>
>>> Dear Dirceu
>>>
>>>
>>> Thank you for your help.
>>>
>>>
>>> Acutally, I use Intellij IDEA to dubug the spark code.
>>>
>>>
>>> Let me use the following code snippet to illustrate my problem. In the
>>> code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
>>> when i debuged the code, it did not stop at breakpoint-1, it seems that
>>> the map
>>>
>>> function was skipped and it directly reached and stoped at the
>>> breakpoint-2.
>>>
>>> Additionally, I find the following two posts
>>> (1)http://stackoverflow.com/questions/29208844/apache-spark-
>>> logging-within-scala
>>> (2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html
>>>
>>> I am wondering whether loggin is an alternative approach to debugging
>>> spark anonymous functions.
>>>
>>>
>>> val count = spark.parallelize(1 to n, slices).map { i =>
>>>   val x = random * 2 - 1
>>>   val y = random * 2 - 1 (breakpoint-1 set in this line)
>>>   if (x*x + y*y < 1) 1 else 0
>>> }.reduce(_ + _)
>>> val test = x (breakpoint-2 set in this line)
>>>
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>>> *发送时间:* 2016年9月14日 23:32
>>> *收件人:* chen yong
>>> *主题:* Re: 答复: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> I don't know which IDE do you use. I use Intellij, and here there is an
>>> Evaluate Expression dialog where I can execute code, whenever it has
>>> stopped in a breakpoint.
>>> In eclipse you have watch and inspect where you can do the same.
>>> Probably you are not seeing the debug stop in your functions because you
>>> never retrieve the data from your DataFrame/RDDs.
>>> What are you doing with this function? Are you getting the result of
>>> this RDD/Dataframe at some place?
>>> You can add a count after the function that you want to debug, just for
>>> debug, but don't forget to remove this after testing.
>>>
>>>
>>>
>>> 2016-09-14 12:20 GMT-03:00 chen yong <cy...@hotmail.com>:
>>>
>&g

Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho

Hello Felix,
No, this line isn't the one that is triggering the execution of the
function, the count does that, unless your count val is a lazy val.
The count method is the one that retrieves the information of the rdd, it
has do go through all of it's data do determine how many records the RDD
has.

Regards,

2016-09-15 22:23 GMT-03:00 chen yong <cy...@hotmail.com>:

>
> Dear Dirceu,
>
> Thanks for your kind help.
> i cannot see any code line corresponding to ". retrieve the data from
> your DataFrame/RDDs". which you suggested in the previous replies.
>
> Later, I guess
>
> the line
>
> val test = count
>
> is the key point. without it, it would not stop at the breakpont-1, right?
>
>
>
> --
> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> *发送时间:* 2016年9月16日 0:39
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
> anonymous function
>
> Hi Felix,
> Are sure your n is greater than 0?
> Here it stops first at breakpoint 1, image attached.
> Have you got the count to see if it's also greater than 0?
>
> 2016-09-15 11:41 GMT-03:00 chen yong <cy...@hotmail.com>:
>
>> Dear Dirceu
>>
>>
>> Thank you for your help.
>>
>>
>> Acutally, I use Intellij IDEA to dubug the spark code.
>>
>>
>> Let me use the following code snippet to illustrate my problem. In the
>> code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
>> when i debuged the code, it did not stop at breakpoint-1, it seems that
>> the map
>>
>> function was skipped and it directly reached and stoped at the
>> breakpoint-2.
>>
>> Additionally, I find the following two posts
>> (1)http://stackoverflow.com/questions/29208844/apache-spark-
>> logging-within-scala
>> (2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html
>>
>> I am wondering whether loggin is an alternative approach to debugging
>> spark anonymous functions.
>>
>>
>> val count = spark.parallelize(1 to n, slices).map { i =>
>>   val x = random * 2 - 1
>>   val y = random * 2 - 1 (breakpoint-1 set in this line)
>>   if (x*x + y*y < 1) 1 else 0
>> }.reduce(_ + _)
>> val test = x (breakpoint-2 set in this line)
>>
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>> *发送时间:* 2016年9月14日 23:32
>> *收件人:* chen yong
>> *主题:* Re: 答复: 答复: t it does not stop at breakpoints which is in an
>> anonymous function
>>
>> I don't know which IDE do you use. I use Intellij, and here there is an
>> Evaluate Expression dialog where I can execute code, whenever it has
>> stopped in a breakpoint.
>> In eclipse you have watch and inspect where you can do the same.
>> Probably you are not seeing the debug stop in your functions because you
>> never retrieve the data from your DataFrame/RDDs.
>> What are you doing with this function? Are you getting the result of this
>> RDD/Dataframe at some place?
>> You can add a count after the function that you want to debug, just for
>> debug, but don't forget to remove this after testing.
>>
>>
>>
>> 2016-09-14 12:20 GMT-03:00 chen yong <cy...@hotmail.com>:
>>
>>> Dear Dirceu,
>>>
>>>
>>> thanks you again.
>>>
>>>
>>> Actually,I never saw it stopped at the breakpoints no matter how long I
>>> wait.  It just skipped the whole anonymous function to direactly reach
>>> the first breakpoint immediately after the anonymous function body. Is that
>>> normal? I suspect sth wrong in my debugging operations or settings. I am
>>> very new to spark and  scala.
>>>
>>>
>>> Additionally, please give me some detailed instructions about  "Some
>>> ides provide you a place where you can execute the code to see it's
>>> results". where is the PLACE
>>>
>>>
>>> your help badly needed!
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>>> *发送时间:* 2016年9月14日 23:07
>>> *收件人:* chen yong
>>> *主题:* Re: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> You can call a count in the ide just to debug, or you can wait until it
>>> reaches the code, so you can debug.
>>> Some ides provide you a place where you can execut

Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-15 Thread Dirceu Semighini Filho

Hi Felix,
Are sure your n is greater than 0?
Here it stops first at breakpoint 1, image attached.
Have you got the count to see if it's also greater than 0?

2016-09-15 11:41 GMT-03:00 chen yong <cy...@hotmail.com>:

> Dear Dirceu
>
>
> Thank you for your help.
>
>
> Acutally, I use Intellij IDEA to dubug the spark code.
>
>
> Let me use the following code snippet to illustrate my problem. In the
> code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
> when i debuged the code, it did not stop at breakpoint-1, it seems that
> the map
>
> function was skipped and it directly reached and stoped at the
> breakpoint-2.
>
> Additionally, I find the following two posts
> (1)http://stackoverflow.com/questions/29208844/apache-
> spark-logging-within-scala
> (2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html
>
> I am wondering whether loggin is an alternative approach to debugging
> spark anonymous functions.
>
>
> val count = spark.parallelize(1 to n, slices).map { i =>
>   val x = random * 2 - 1
>   val y = random * 2 - 1 (breakpoint-1 set in this line)
>   if (x*x + y*y < 1) 1 else 0
> }.reduce(_ + _)
> val test = x (breakpoint-2 set in this line)
>
>
>
> --
> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> *发送时间:* 2016年9月14日 23:32
> *收件人:* chen yong
> *主题:* Re: 答复: 答复: t it does not stop at breakpoints which is in an
> anonymous function
>
> I don't know which IDE do you use. I use Intellij, and here there is an
> Evaluate Expression dialog where I can execute code, whenever it has
> stopped in a breakpoint.
> In eclipse you have watch and inspect where you can do the same.
> Probably you are not seeing the debug stop in your functions because you
> never retrieve the data from your DataFrame/RDDs.
> What are you doing with this function? Are you getting the result of this
> RDD/Dataframe at some place?
> You can add a count after the function that you want to debug, just for
> debug, but don't forget to remove this after testing.
>
>
>
> 2016-09-14 12:20 GMT-03:00 chen yong <cy...@hotmail.com>:
>
>> Dear Dirceu,
>>
>>
>> thanks you again.
>>
>>
>> Actually,I never saw it stopped at the breakpoints no matter how long I
>> wait.  It just skipped the whole anonymous function to direactly reach
>> the first breakpoint immediately after the anonymous function body. Is that
>> normal? I suspect sth wrong in my debugging operations or settings. I am
>> very new to spark and  scala.
>>
>>
>> Additionally, please give me some detailed instructions about  "Some
>> ides provide you a place where you can execute the code to see it's
>> results". where is the PLACE
>>
>>
>> your help badly needed!
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>> *发送时间:* 2016年9月14日 23:07
>> *收件人:* chen yong
>> *主题:* Re: 答复: t it does not stop at breakpoints which is in an anonymous
>> function
>>
>> You can call a count in the ide just to debug, or you can wait until it
>> reaches the code, so you can debug.
>> Some ides provide you a place where you can execute the code to see it's
>> results.
>> Be aware of not adding this operations in your production code, because
>> they can slow down the execution of your code.
>>
>>
>>
>> 2016-09-14 11:43 GMT-03:00 chen yong <cy...@hotmail.com>:
>>
>>>
>>> Thanks for your reply.
>>>
>>> you mean i have to insert some codes, such as x.count or
>>> x.collect, between the original spark code lines to invoke some
>>> operations, right?
>>> but, where is the right places to put my code lines?
>>>
>>> Felix
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
>>> *发送时间:* 2016年9月14日 22:33
>>> *收件人:* chen yong
>>> *抄送:* user@spark.apache.org
>>> *主题:* Re: t it does not stop at breakpoints which is in an anonymous
>>> function
>>>
>>> Hello Felix,
>>> Spark functions run lazy, and that's why it doesn't stop in those
>>> breakpoints.
>>> They will be executed only when you call some methods of your
>>> dataframe/rdd, like the count, collect, ...
>>>
>>> Regards,
>>> Dirceu
>>>
>>> 2016-09-14 11:26 GMT-03:00 chen yong <cy...@hotmail.com>:
>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I am newbie to spark. I am learning spark by debugging the spark code.
>>>> It is strange to me that it does not stop at breakpoints  which is
>>>> in  an anonymous function, it is normal in ordianry function, though.
>>>> It that normal. How to obverse variables in  an anonymous function.
>>>>
>>>>
>>>> Please help me. Thanks in advance!
>>>>
>>>>
>>>> Felix
>>>>
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread Dirceu Semighini Filho

Hello Felix,
Spark functions run lazy, and that's why it doesn't stop in those
breakpoints.
They will be executed only when you call some methods of your
dataframe/rdd, like the count, collect, ...

Regards,
Dirceu

2016-09-14 11:26 GMT-03:00 chen yong :

> Hi all,
>
>
>
> I am newbie to spark. I am learning spark by debugging the spark code. It
> is strange to me that it does not stop at breakpoints  which is
> in  an anonymous function, it is normal in ordianry function, though. It
> that normal. How to obverse variables in  an anonymous function.
>
>
> Please help me. Thanks in advance!
>
>
> Felix
>

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Dirceu Semighini Filho

Hi Madabhattula Rajesh Kumar,
There is an open source project called sparkts
 (Time Series for Spark) that
implement ARIMA and Holtwinters algorithms on top of Spark, which can be
used for forecast. In some cases,  Linear Regression, which is  avalilable
in MLlib, is suitable for forecasting.
In our packaging of Spark, we integrated sparkts
 and XGboost
. Feel free to take a look there for
reference. http://github.com/eleflow/uberdata

Kind Regards,
Dirceu

2016-09-08 4:18 GMT-03:00 Robin East :

> Sparks algorithms are summarised on this page (https://spark.apache.org/
> mllib/) and details are available from the MLLib user guide which is
> linked from the above URL
>
> Sent from my iPhone
>
> > On 8 Sep 2016, at 05:30, Madabhattula Rajesh Kumar 
> wrote:
> >
> > Hi,
> >
> > Please let me know supported Forecasting algorithms in spark ML
> >
> > Regards,
> > Rajesh
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho

Try this:
Is this python right? I'm not used to it, I'm used to scala, so

val toDebug = rdd.foreachPartition(partition -> { //breakpoint stop here
*// by val toDebug I mean to assign the result of foreachPartition to a
variable*
partition.forEachRemaining(message -> {
//breakpoint doenst stop here

 })
});

*toDebug.first* // now is when this method will run


2016-05-31 17:59 GMT-03:00 Marcelo Oikawa :

>
>
>> Hi Marcelo, this is because the operations in rdd are lazy, you will only
>> stop at this inside foreach breakpoint when you call a first, a collect or
>> a reduce operation.
>>
>
> Does forEachRemaining isn't a final method as first, collect or reduce?
> Anyway, I guess this is not the problem itself because the code inside
> forEachRemaining runs well but I can't debug this block.
>
>
>> This is when the spark will run the operations.
>> Have you tried that?
>>
>> Cheers.
>>
>> 2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :
>>
>>> Hello, list.
>>>
>>> I'm trying to debug my spark application on Intellij IDE. Before I
>>> submit my job, I ran the command line:
>>>
>>> export
>>> SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
>>>
>>> after that:
>>>
>>> bin/spark-submit app-jar-with-dependencies.jar 
>>>
>>> The IDE connects with the running job but all code that is running on
>>> worker machine is unreachable to debug. See below:
>>>
>>> rdd.foreachPartition(partition -> { //breakpoint stop here
>>>
>>> partition.forEachRemaining(message -> {
>>>
>>> //breakpoint doenst stop here
>>>
>>>  })
>>> });
>>>
>>> Does anyone know if is is possible? How? Any ideas?
>>>
>>>
>>>
>>
>

Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho

Hi Marcelo, this is because the operations in rdd are lazy, you will only
stop at this inside foreach breakpoint when you call a first, a collect or
a reduce operation.
This is when the spark will run the operations.
Have you tried that?

Cheers.

2016-05-31 17:18 GMT-03:00 Marcelo Oikawa :

> Hello, list.
>
> I'm trying to debug my spark application on Intellij IDE. Before I submit
> my job, I ran the command line:
>
> export
> SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
>
> after that:
>
> bin/spark-submit app-jar-with-dependencies.jar 
>
> The IDE connects with the running job but all code that is running on
> worker machine is unreachable to debug. See below:
>
> rdd.foreachPartition(partition -> { //breakpoint stop here
>
> partition.forEachRemaining(message -> {
>
> //breakpoint doenst stop here
>
>  })
> });
>
> Does anyone know if is is possible? How? Any ideas?
>
>
>

Re: ClassNotFoundException in RDD.map

2016-03-23 Thread Dirceu Semighini Filho

Thanks Jacob,
I've looked into the source code here and found that I miss this property
there:
spark.repl.class.uri

Putting it solved the problem

Cheers

2016-03-17 18:14 GMT-03:00 Jakob Odersky <ja...@odersky.com>:

> The error is very strange indeed, however without code that reproduces
> it, we can't really provide much help beyond speculation.
>
> One thing that stood out to me immediately is that you say you have an
> RDD of Any where every Any should be a BigDecimal, so why not specify
> that type information?
> When using Any, a whole class of errors, that normally the typechecker
> could catch, can slip through.
>
> On Thu, Mar 17, 2016 at 10:25 AM, Dirceu Semighini Filho
> <dirceu.semigh...@gmail.com> wrote:
> > Hi Ted, thanks for answering.
> > The map is just that, whenever I try inside the map it throws this
> > ClassNotFoundException, even if I do map(f => f) it throws the exception.
> > What is bothering me is that when I do a take or a first it returns the
> > result, which make me conclude that the previous code isn't wrong.
> >
> > Kind Regards,
> > Dirceu
> >
> >
> > 2016-03-17 12:50 GMT-03:00 Ted Yu <yuzhih...@gmail.com>:
> >>
> >> bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
> >>
> >> Do you mind showing more of your code involving the map() ?
> >>
> >> On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho
> >> <dirceu.semigh...@gmail.com> wrote:
> >>>
> >>> Hello,
> >>> I found a strange behavior after executing a prediction with MLIB.
> >>> My code return an RDD[(Any,Double)] where Any is the id of my dataset,
> >>> which is BigDecimal, and Double is the prediction for that line.
> >>> When I run
> >>> myRdd.take(10) it returns ok
> >>> res16: Array[_ >: (Double, Double) <: (Any, Double)] =
> >>> Array((1921821857196754403.00,0.1690292052496703),
> >>> (454575632374427.00,0.16902820241892452),
> >>> (989198096568001939.00,0.16903432789699502),
> >>> (14284129652106187990.00,0.16903517653451386),
> >>> (17980228074225252497.00,0.16903151028332508),
> >>> (3861345958263692781.00,0.16903056986183976),
> >>> (17558198701997383205.00,0.1690295450319745),
> >>> (10651576092054552310.00,0.1690286445174418),
> >>> (4534494349035056215.00,0.16903303401862327),
> >>> (5551671513234217935.00,0.16902303368995966))
> >>> But when I try to run some map on it:
> >>> myRdd.map(_._1).take(10)
> >>> It throws a ClassCastException:
> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0
> >>> in stage 72.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage
> >>> 72.0 (TID 1774, 172.31.23.208): java.lang.ClassNotFoundException:
> >>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> >>> at java.security.AccessController.doPrivileged(Native Method)
> >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> >>> at java.lang.Class.forName0(Native Method)
> >>> at java.lang.Class.forName(Class.java:278)
> >>> at
> >>>
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> >>> at
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> >>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> >>> at
> >>>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> >>> at
> >>>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> >>> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> >>> at
> >>>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> >>> at
> >>>
> java.io.ObjectInputStream.defau

Re: Serialization issue with Spark

2016-03-23 Thread Dirceu Semighini Filho

Hello Hafsa,
TaskNotSerialized exception usually means that you are trying to use an
object, defined in the driver, in code that runs on workers.
Can you post the code that ir generating this error here, so we can better
advise you?

Cheers.

2016-03-23 14:14 GMT-03:00 Hafsa Asif :

> Can anyone please help me in this issue?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-issue-with-Spark-tp26565p26579.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

ClassNotFoundException in RDD.map

2016-03-20 Thread Dirceu Semighini Filho

Hello,
I found a strange behavior after executing a prediction with MLIB.
My code return an RDD[(Any,Double)] where Any is the id of my dataset,
which is BigDecimal, and Double is the prediction for that line.
When I run
myRdd.take(10) it returns ok
res16: Array[_ >: (Double, Double) <: (Any, Double)] =
Array((1921821857196754403.00,0.1690292052496703),
(454575632374427.00,0.16902820241892452),
(989198096568001939.00,0.16903432789699502),
(14284129652106187990.00,0.16903517653451386),
(17980228074225252497.00,0.16903151028332508),
(3861345958263692781.00,0.16903056986183976),
(17558198701997383205.00,0.1690295450319745),
(10651576092054552310.00,0.1690286445174418),
(4534494349035056215.00,0.16903303401862327),
(5551671513234217935.00,0.16902303368995966))
But when I try to run some map on it:
myRdd.map(_._1).take(10)
It throws a ClassCastException:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 72.0 failed 4 times, most recent failure: Lost task 0.3 in stage
72.0 (TID 1774, 172.31.23.208): java.lang.ClassNotFoundException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Dirceu Semighini Filho

Hi Ted, thanks for answering.
The map is just that, whenever I try inside the map it throws this
ClassNotFoundException, even if I do map(f => f) it throws the exception.
What is bothering me is that when I do a take or a first it returns the
result, which make me conclude that the previous code isn't wrong.

Kind Regards,
Dirceu

2016-03-17 12:50 GMT-03:00 Ted Yu <yuzhih...@gmail.com>:

> bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>
> Do you mind showing more of your code involving the map() ?
>
> On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hello,
>> I found a strange behavior after executing a prediction with MLIB.
>> My code return an RDD[(Any,Double)] where Any is the id of my dataset,
>> which is BigDecimal, and Double is the prediction for that line.
>> When I run
>> myRdd.take(10) it returns ok
>> res16: Array[_ >: (Double, Double) <: (Any, Double)] =
>> Array((1921821857196754403.00,0.1690292052496703),
>> (454575632374427.00,0.16902820241892452),
>> (989198096568001939.00,0.16903432789699502),
>> (14284129652106187990.00,0.16903517653451386),
>> (17980228074225252497.00,0.16903151028332508),
>> (3861345958263692781.00,0.16903056986183976),
>> (17558198701997383205.00,0.1690295450319745),
>> (10651576092054552310.00,0.1690286445174418),
>> (4534494349035056215.00,0.16903303401862327),
>> (5551671513234217935.00,0.16902303368995966))
>> But when I try to run some map on it:
>> myRdd.map(_._1).take(10)
>> It throws a ClassCastException:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 72.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 72.0 (TID 1774, 172.31.23.208): java.lang.ClassNotFoundException:
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:278)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java

Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho

Thanks Sun, this explain why I was getting too many jobs running, my RDDs
were empty.



2016-03-02 10:29 GMT-03:00 Sun, Rui <rui@intel.com>:

> This is nothing to do with object serialization/deserialization. It is
> expected behavior that take(1) most likely runs slower than count() on an
> empty RDD.
>
> This is all about the algorithm with which take() is implemented. Take()
> 1. Reads one partition to get the elements
> 2. If the fetched elements do not satisfy the limit, it will estimate the
> number of additional partitions and fetch elements in them.
> Take() repeats the step 2 until it get the desired number of elements or
> it will go through all partitions.
>
> So take(1) on an empty RDD will go through all partitions in a sequential
> way.
>
> Comparing with take(), Count() also computes all partition, but the
> computation is parallel on all partitions at once.
>
> Take() implementation in SparkR is less optimized than that in Scala as
> SparkR won't estimate the number of additional partitions but will read
> just one partition in each fetch.
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Wednesday, March 2, 2016 3:37 AM
> To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
> Cc: user <user@spark.apache.org>
> Subject: Re: SparkR Count vs Take performance
>
> Yeah one surprising result is that you can't call isEmpty on an RDD of
> nonserializable objects. You can't do much with an RDD of nonserializable
> objects anyway, but they can exist as an intermediate stage.
>
> We could fix that pretty easily with a little copy and paste of the
> take() code; right now isEmpty is simple but has this drawback.
>
> On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
> > Great, I didn't noticed this isEmpty method.
> > Well serialization is been a problem in this project, we have noticed
> > a lot of time been spent in serializing and deserializing things to
> > send and get from the cluster.
> >
> > 2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:
> >>
> >> There is an "isEmpty" method that basically does exactly what your
> >> second version does.
> >>
> >> I have seen it be unusually slow at times because it must copy 1
> >> element to the driver, and it's possible that's slow. It still
> >> shouldn't be slow in general, and I'd be surprised if it's slower
> >> than a count in all but pathological cases.
> >>
> >>
> >>
> >> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
> >> <dirceu.semigh...@gmail.com> wrote:
> >> > Hello all.
> >> > I have a script that create a dataframe from this operation:
> >> >
> >> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
> >> >
> >> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
> >> > dFrame <-
> >> > join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
> >> >
> >> > After filtering this dFrame with this:
> >> >
> >> >
> >> > I tried to execute the following
> >> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN']
> >> > %in% c("VALUES", ...)}) Now I need to know if the resulting
> >> > dataframe is empty, and to do that I tried this two codes:
> >> > if(count(filteredDF) > 0)
> >> > and
> >> > if(length(take(filteredDF,1)) > 0)
> >> > I thought that the second one, using take, shoule run faster than
> >> > count, but that didn't happen.
> >> > The take operation creates one job per partition of my rdd (which
> >> > was
> >> > 200)
> >> > and this make it to run slower than the count.
> >> > Is this the expected behaviour?
> >> >
> >> > Regards,
> >> > Dirceu
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

Re: SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho

Great, I didn't noticed this isEmpty method.
Well serialization is been a problem in this project, we have noticed a lot
of time been spent in serializing and deserializing things to send and get
from the cluster.

2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>:

> There is an "isEmpty" method that basically does exactly what your
> second version does.
>
> I have seen it be unusually slow at times because it must copy 1
> element to the driver, and it's possible that's slow. It still
> shouldn't be slow in general, and I'd be surprised if it's slower than
> a count in all but pathological cases.
>
>
>
> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho
> <dirceu.semigh...@gmail.com> wrote:
> > Hello all.
> > I have a script that create a dataframe from this operation:
> >
> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))
> >
> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
> > dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)
> >
> > After filtering this dFrame with this:
> >
> >
> > I tried to execute the following
> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
> > c("VALUES", ...)})
> > Now I need to know if the resulting dataframe is empty, and to do that I
> > tried this two codes:
> > if(count(filteredDF) > 0)
> > and
> > if(length(take(filteredDF,1)) > 0)
> > I thought that the second one, using take, shoule run faster than count,
> but
> > that didn't happen.
> > The take operation creates one job per partition of my rdd (which was
> 200)
> > and this make it to run slower than the count.
> > Is this the expected behaviour?
> >
> > Regards,
> > Dirceu
>

SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho

Hello all.
I have a script that create a dataframe from this operation:

mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))

rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)

After filtering this dFrame with this:


I tried to execute the following
filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
c("VALUES", ...)})
Now I need to know if the resulting dataframe is empty, and to do that I
tried this two codes:
if(count(filteredDF) > 0)
and
if(length(take(filteredDF,1)) > 0)
I thought that the second one, using take, shoule run faster than count,
but that didn't happen.
The take operation creates one job per partition of my rdd (which was 200)
and this make it to run slower than the count.
Is this the expected behaviour?

Regards,
Dirceu

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Dirceu Semighini Filho

Hi Yash,
I've experienced this behavior here when the process freeze in a worker.
This mainly happen, in my case, when the worker memory was full and the
java GC wasn't able to free memory for the process.
Try to search for outofmemory error in your worker logs.

Regards,
Dirceu

2015-12-22 10:26 GMT-02:00 yaoxiaohua :

> Thanks for your reply.
>
> I find spark-env.sh :
>
> SPARK_JAVA_OPTS="$SPARK_JAVA_OPTS -Dspark.akka.askTimeout=300
> -Dspark.ui.retainedStages=1000 -Dspark.eventLog.enabled=true
> -Dspark.eventLog.dir=hdfs://sparkcluster/user/spark_history_logs
> -Dspark.shuffle.spill=false -Dspark.shuffle.manager=hash
> -Dspark.yarn.max.executor.failures=9 -Dspark.worker.timeout=300"
>
>
>
> I just find log like this:
>
>
> INFO ClientCnxn: Client session timed out, have not heard from server in
> 40015ms for sessionid 0x351c416297a145a, closing socket connection and
> attempting reconnect
>
> Before spark2 master process shut down.
>
> I don’t see any zookeeper timeout setting .
>
>
>
> Best
>
>
>
> *From:* Yash Sharma [mailto:yash...@gmail.com]
> *Sent:* 2015年12月22日 19:55
> *To:* yaoxiaohua
> *Cc:* user@spark.apache.org
> *Subject:* Re: Client session timed out, have not heard from server in
>
>
>
> Hi Evan,
> SPARK-9629 referred to connection issues with zookeeper.  Could you check
> if its working fine in your setup.
>
> Also please share other error logs you might be getting.
>
> - Thanks, via mobile,  excuse brevity.
>
> On Dec 22, 2015 5:00 PM, "yaoxiaohua"  wrote:
>
> Hi,
>
> I encounter a similar question, spark1.4
>
> Master2 run some days , then give a timeout exception, then shutdown.
>
> I found a bug :
>
> https://issues.apache.org/jira/browse/SPARK-9629
>
>
>
>
> INFO ClientCnxn: Client session timed out, have not heard from server in
> 40015ms for sessionid 0x351c416297a145a, closing socket connection and
> attempting reconnect
>
>
>
>
>
> could you tell me what do you do for this?
>
>
>
> Best Regards,
>
> Evan
>

Re: How to set memory for SparkR with master="local[*]"

2015-10-23 Thread Dirceu Semighini Filho

Hi Matej,
I'm also using this and I'm having the same behavior here, my driver has
only 530mb which is the default value.

Maybe this is a bug.

2015-10-23 9:43 GMT-02:00 Matej Holec :

> Hello!
>
> How to adjust the memory settings properly for SparkR with
> master="local[*]"
> in R?
>
>
> *When running from  R -- SparkR doesn't accept memory settings :(*
>
> I use the following commands:
>
> R>  library(SparkR)
> R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
> list(spark.driver.memory = "5g"))
>
> Despite the variable spark.driver.memory is correctly set (checked in
> http://node:4040/environment/), the driver has only the default amount of
> memory allocated (Storage Memory 530.3 MB).
>
> *But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*
>
> The following command:
>
> ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g
>
> creates SparkR session with properly adjustest driver memory (Storage
> Memory
> 2.6 GB).
>
>
> Any suggestion?
>
> Thanks
> Matej
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-memory-for-SparkR-with-master-local-tp25178.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re:

2015-10-15 Thread Dirceu Semighini Filho

Hi Anfemee,
Subject in the email sometimes help ;)
Have you seen if the link is sending you to a hostname that is not
accessible by your workstation? Sometimes changing the hostname to the ip
solve this kind of issue.



2015-10-15 13:34 GMT-03:00 Anfernee Xu :

> Sorry, I have to re-send it again as I did not get the answer.
>
> Here's the problem I'm facing, I have a standalone java application which
> is periodically submit Spark jobs to my yarn cluster, btw I'm not using
> 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs
> are successful and I can see them on Yarn RM webUI, but when I want to
> follow the link to the app history on Spark historyserver, I always got
> 404(application is not found) from Spark historyserver.
>
> My code looks likes as below
>
>
> SparkConf conf = new
> SparkConf().setAppName("testSpak").setMaster("yarn-client")
> .setJars(new String[]{IOUtil.getJar(MySparkApp.class)});
>
> conf.set("spark.yarn.historyServer.address", "10.247.44.155:18080");
> conf.set("spark.history.fs.logDirectory",
> "
> hdfs://myHdfsNameNode:55310/scratch/tie/spark/applicationHistory");
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> try {
>
>  ... my application code
>
> }finally{
>   sc.stop();
>
> }
>
> Anything I did wrong or missed? Do I need to configure something on Yarn
> side?
>
> Thanks
>
> --
> --Anfernee
>

Spark 1.5.1 ThriftServer

2015-10-15 Thread Dirceu Semighini Filho

Hello,
I'm trying to migrate to scala 2.11 and I didn't found a spark-thriftserver
jar for scala 2.11 in maven repository.
I could a manual build (without tests) the spark with thriftserver in scala
2.11.
Sometime ago the thrift server build wasn't enabled by default, but I can
find a 2.10 jar for thrift server.
Is there any problem with thriftserver in scala 2.11?

Re: Null Value in DecimalType column of DataFrame

2015-09-18 Thread Dirceu Semighini Filho

Hi Yin, I got that part.
I just think that instead of returning null, throwing an exception would be
better. In the exception message we can explain that the DecimalType used
can't fit the number that is been converted due to the precision and scale
values used to create it.
It would be easier for the user to find the reason why that error is
happening, instead of receiving an NullPointerException in another part of
his code. We can also make a better documentation of DecimalType classes to
explain this behavior, what do you think?




2015-09-17 18:52 GMT-03:00 Yin Huai <yh...@databricks.com>:

> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
> 10), I do not think there is any better returned value except of null.
> Looks like DecimalType(10, 10) is not the right type for your use case. You
> need a decimal type that has precision - scale >= 2.
>
> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>>
>> Hi Yin, posted here because I think it's a bug.
>> So, it will return null and I can get a nullpointerexception, as I was
>> getting. Is this really the expected behavior? Never seen something
>> returning null in other Scala tools that I used.
>>
>> Regards,
>>
>>
>> 2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:
>>
>>> btw, move it to user list.
>>>
>>> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>>> A scale of 10 means that there are 10 digits at the right of the
>>>> decimal point. If you also have precision 10, the range of your data will
>>>> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, which
>>>> is expected.
>>>>
>>>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>>>> dirceu.semigh...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>>>> It seems that there was some changes in org.apache.spark.sql.types.
>>>>> DecimalType
>>>>>
>>>>> This ugly code is a little sample to reproduce the error, don't use it
>>>>> into your project.
>>>>>
>>>>> test("spark test") {
>>>>>   val file = 
>>>>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>>>>> Row.fromSeq({
>>>>> val values = f.split(",")
>>>>> 
>>>>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>>>> values.tail.tail.tail.head)}))
>>>>>
>>>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>>>> StructField("int2", IntegerType, false), StructField("double",
>>>>>
>>>>>  DecimalType(10,10), false),
>>>>>
>>>>>
>>>>> StructField("str2", StringType, false)))
>>>>>
>>>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>>>   df.first
>>>>> }
>>>>>
>>>>> The content of the file is:
>>>>>
>>>>> 1,5,10.5,va
>>>>> 2,1,0.1,vb
>>>>> 3,8,10.0,vc
>>>>>
>>>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>>>> DecimalType(10,10) the Decimal values
>>>>> 10.5 became null, and the 0.1 works.
>>>>>
>>>>> Is there anybody working with DecimalType for 1.5.1?
>>>>>
>>>>> Regards,
>>>>> Dirceu
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Read Local File

Read Local File

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Why does dataset.union fails but dataset.rdd.union execute correctly?

Cant convert Dataset to case class with Option fields

Re: specifing schema on dataframe

Re: Time-Series Analysis with Spark

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

Re: coalesce ending up very unbalanced - but why?

Re: coalesce ending up very unbalanced - but why?

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

Re: Writing parquet table using spark

Can somebody remove this guy?

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

Re: 答复: it does not stop at breakpoints which is in an anonymous function

Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

Re: t it does not stop at breakpoints which is in an anonymous function

Re: Forecasting algorithms in spark ML

Re: Debug spark jobs on Intellij

Re: Debug spark jobs on Intellij

Re: ClassNotFoundException in RDD.map

Re: Serialization issue with Spark

ClassNotFoundException in RDD.map

Re: ClassNotFoundException in RDD.map

Re: SparkR Count vs Take performance

Re: SparkR Count vs Take performance

SparkR Count vs Take performance

Re: Client session timed out, have not heard from server in

Re: How to set memory for SparkR with master="local[*]"

Re:

Spark 1.5.1 ThriftServer

Re: Null Value in DecimalType column of DataFrame

36 matches

Site Navigation

Mail list logo

Footer information