date:20160724

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Sean Owen

Are you certain? looks like it was correct in the release:

https://github.com/apache/spark/blob/v1.6.2/core/src/main/scala/org/apache/spark/package.scala



On Mon, Jul 25, 2016 at 12:33 AM, Ascot Moss  wrote:
> Hi,
>
> I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I
> found the version is still displayed 1.6.1
>
> Is this a minor typo/bug?
>
> Regards
>
>
>
> ###
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>
>   /_/
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2016-07-24 Thread Uzi Hadad

unsubscribe)

2016-07-24 Thread Uzi Hadad

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath

Good suggestion Krishna

One issue is that this doesn't work with TrainValidationSplit or
CrossValidator for parameter tuning. Hence my solution in the PR which
makes it work with the cross-validators.

On Mon, 25 Jul 2016 at 00:42, Krishna Sankar  wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> 
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath  > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>>
 Sean,

 I did this just to test the model. When I do a split of my data as
 training to 80% and test to be 20%

 I get a Root-mean-square error = NaN

 So I am wondering where I might be going wrong

 Regards,
 VG

 On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel
> and for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>


>

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang

Spark MLlib KMeansModel provides "computeCost" function which return the
sum of squared distances of points to their nearest center as the k-means
cost on the given dataset.

Thanks
Yanbo

2016-07-24 17:30 GMT-07:00 janardhan shetty :

> Hi,
>
> I was trying to evaluate k-means clustering prediction since the exact
> cluster numbers were provided before hand for each data point.
> Just tried the Error = Predicted cluster number - Given number as brute
> force method.
>
> What are the evaluation metrics available in Spark for K-means clustering
> validation to improve?
>

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang

You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501)
for porting spark.mllib.fpm to spark.ml.

Thanks
Yanbo

2016-07-24 11:18 GMT-07:00 janardhan shetty :

> Is there any implementation of FPGrowth and Association rules in Spark
> Dataframes ?
> We have in RDD but any pointers to Dataframes ?
>

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin

hi,all :
I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I
got error :
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
at
org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57)
at
org.apache.spark.examples.streaming.KafkaWordCount.main(KafkaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more

so where I can find spark-streaming-kafka for spark2.0

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot

Hi,
I am getting below error when I am trying to save dataframe using Spark-CSV

>
> final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path)


java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at
> com.databricks.spark.csv.util.CompressionCodecs$.(CompressionCodecs.scala:29)
> at
> com.databricks.spark.csv.util.CompressionCodecs$.(CompressionCodecs.scala)
> at
> com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:189)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:97)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:116)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:118)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:120)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:122)
> at $iwC$$iwC$$iwC$$iwC.(:124)
> at $iwC$$iwC$$iwC.(:126)
> at $iwC$$iwC.(:128)
> at $iwC.(:130)
> at (:132)
> at .(:136)
> at .()
> at .(:7)
> at .()
> at $print()




*I used Same with Spark 1.5 and never faced this issue prior to this.*
Am I missing something.
Would really appreciate the help.


Thanks,
Divya

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich

You can load the text with sc.textFile() to an RDD[String], then use .map() to 
convert it into an RDD[Row]. At this point you are ready to apply a schema. Use 
sqlContext.createDataFrame(rddOfRow, structType)

Here is an example on how to define the StructType (schema) that you will 
combine with the RDD[Row] to create a DataFrame.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType

Once you have the DataFrame, save it to parquet with dataframe.save(“/path”) to 
create a parquet file.

Reference for SQLContext / createDataFrame: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext

> On Jul 24, 2016, at 5:34 PM, janardhan shetty  wrote:
> 
> We have data in Bz2 compression format. Any links in Spark to convert into 
> Parquet and also performance benchmarks and uses study materials ?

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Rohit Chaddha

Great thanks both of you.  I was struggling with this issue as well.

-Rohit


On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar  wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> 
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath  > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>>
 Sean,

 I did this just to test the model. When I do a split of my data as
 training to 80% and test to be 20%

 I get a Root-mean-square error = NaN

 So I am wondering where I might be going wrong

 Regards,
 VG

 On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel
> and for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>


>

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich

You can use the .repartition() function on the rdd or dataframe to set 
the number of partitions higher. Use .partitions.length to get the current 
number of partitions. (Scala API).

Andrew

> On Jul 24, 2016, at 4:30 PM, Ascot Moss  wrote:
> 
> the data set is the training data set for random forest training, about 
> 36,500 data,  any idea how to further partition it?  
> 
> On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich  > wrote:
> It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 
>  which limits the size of 
> the blocks in the file being written to disk to 2GB.
> 
> If so, the solution is for you to try tuning for smaller tasks. Try 
> increasing the number of partitions, or using a more space-efficient data 
> structure inside the RDD, or increasing the amount of memory available to 
> spark and caching the data in memory. Make sure you are using Kryo 
> serialization. 
> 
> Andrew
> 
>> On Jul 23, 2016, at 9:00 PM, Ascot Moss > > wrote:
>> 
>> 
>> Hi,
>> 
>> Please help!
>> 
>> My spark: 1.6.2
>> Java: java8_u40
>> 
>> I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE".
>> 
>> Any idea how to resolve it?
>> 
>> 
>> (the log) 
>> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 
>> 25)   
>> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE  
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) 
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>> 
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>> 
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) 
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) 
>> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)  
>>
>> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>>
>> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)   
>>
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)  
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)  
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)   
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)   
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>   
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>   
>> at java.lang.Thread.run(Thread.java:745)
>> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25, 
>> localhost): java.lang.IllegalArgumentException: Size exceeds 
>> Integer.MAX_VALUE   
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) 
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>> 
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>> 
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) 
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) 
>> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)  
>>
>> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>>
>> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)   
>>
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)  
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)  
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)   
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)   
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>>

Bzip2 to Parquet format

2016-07-24 Thread janardhan shetty

We have data in Bz2 compression format. Any links in Spark to convert into
Parquet and also performance benchmarks and uses study materials ?

K-means Evaluation metrics

2016-07-24 Thread janardhan shetty

Hi,

I was trying to evaluate k-means clustering prediction since the exact
cluster numbers were provided before hand for each data point.
Just tried the Error = Predicted cluster number - Given number as brute
force method.

What are the evaluation metrics available in Spark for K-means clustering
validation to improve?

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Thanks Marco. This solved the order problem. Had another question which is
prefix to this.

As you can see below ID2,ID1 and ID3 are in order and I need to maintain
this index order as well. But when we do groupByKey
operation(*rdd.distinct.groupByKey().mapValues(v
=> v.toArray*))
everything is *jumbled*.
Is there any way we can maintain this order as well ?

scala> RDD.foreach(println)
(ID2,18159)
(ID1,18159)
(ID3,18159)

(ID2,18159)
(ID1,18159)
(ID3,18159)

(ID2,36318)
(ID1,36318)
(ID3,36318)

(ID2,54477)
(ID1,54477)
(ID3,54477)

*Jumbled version : *
Array(
(ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, 145272,
100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703,
160992, 45431, 162076)),
(ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
45431, *36318*, 162076))
)

*Expected output:*
Array(
(ID1,Array(*18159*,*36318*, *54477,...*)),
(ID3,Array(*18159*,*36318*, *54477, ...*)),
(ID2,Array(*18159*,*36318*, *54477, ...*))
)

As you can see after *groupbyKey* operation is complete item 18519 is in
index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected is
index 0


On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni 
wrote:

> Hello
>  Uhm you have an array containing 3 tuples?
> If all the arrays have same length, you can just zip all of them,
> creatings a list of tuples
> then you can scan the list 5 by 5...?
>
> so something like
>
> (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList
>
> this will give you a list of tuples of 3 elements containing each items
> from ID1, ID2 and ID3  ... sample below
> res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159,
> 89366)..)
>
> then you can use a recursive function to compare each element such as
>
> def iterate(lst:List[(Int, Int, Int)]):T = {
> if (lst.isEmpty): /// return your comparison
> else {
>  val splits = lst.splitAt(5)
>  // do sometjhing about it using splits._1
>  iterate(splits._2)
>}
>
> will this help? or am i still missing something?
>
> kr
>
>
>
>
>
>
>
>
>
>
>
>
> On 24 Jul 2016 5:52 pm, "janardhan shetty"  wrote:
>
>> Array(
>> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
>> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
>> 45431, 100136)),
>> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
>> 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
>> 45431, 162076)),
>> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, 44683,
>> 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431, 36318,
>> 162076))
>> )
>>
>> I need to compare first 5 elements of ID1 with first five element of ID3
>> next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
>> order until the end of number of elements.
>> Let me know if this helps
>>
>>
>> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni 
>> wrote:
>>
>>> Apologies I misinterpreted could you post two use cases?
>>> Kr
>>>
>>> On 24 Jul 2016 3:41 pm, "janardhan shetty" 
>>> wrote:
>>>
 Marco,

 Thanks for the response. It is indexed order and not ascending or
 descending order.
 On Jul 24, 2016 7:37 AM, "Marco Mistroni"  wrote:

> Use map values to transform to an rdd where values are sorted?
> Hth
>
> On 24 Jul 2016 6:23 am, "janardhan shetty" 
> wrote:
>
>> I have a key,value pair rdd where value is an array of Ints. I need
>> to maintain the order of the value in order to execute downstream
>> modifications. How do we maintain the order of values?
>> Ex:
>> rdd = (id1,[5,2,3,15],
>> Id2,[9,4,2,5])
>>
>> Followup question how do we compare between one element in rdd with
>> all other elements ?
>>
>
>>

Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Ascot Moss

Hi,

I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I
found the version is still displayed 1.6.1

Is this a minor typo/bug?

Regards



###

Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1

  /_/

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Ascot Moss

the data set is the training data set for random forest training, about
36,500 data,  any idea how to further partition it?

On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich 
wrote:

> It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which
> limits the size of the blocks in the file being written to disk to 2GB.
>
> If so, the solution is for you to try tuning for smaller tasks. Try
> increasing the number of partitions, or using a more space-efficient data
> structure inside the RDD, or increasing the amount of memory available to
> spark and caching the data in memory. Make sure you are using Kryo
> serialization.
>
> Andrew
>
> On Jul 23, 2016, at 9:00 PM, Ascot Moss  wrote:
>
>
> Hi,
>
> Please help!
>
> My spark: 1.6.2
> Java: java8_u40
>
> I am trying random forest training, I got " Size exceeds
> Integer.MAX_VALUE".
>
> Any idea how to resolve it?
>
>
> (the log)
> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID
> 25)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25,
> localhost): java.lang.IllegalArgumentException: Size exceeds
> Integer.MAX_VALUE
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>
> at
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
> at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
>
> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
>
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Regards
>
>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar

Thanks Nick. I also ran into this issue.
VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
then use the dataset for the evaluator. In real life, probably detect the
NaN and recommend most popular on some window.
HTH.
Cheers


On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath 
wrote:

> It seems likely that you're running into
> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
> test dataset in the train/test split contains users or items that were not
> in the training set. Hence the model doesn't have computed factors for
> those ids, and ALS 'transform' currently returns NaN for those ids. This in
> turn results in NaN for the evaluator result.
>
> I have a PR open on that issue that will hopefully address this soon.
>
>
> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>
>> ping. Anyone has some suggestions/advice for me .
>> It will be really helpful.
>>
>> VG
>> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>>
>>> Sean,
>>>
>>> I did this just to test the model. When I do a split of my data as
>>> training to 80% and test to be 20%
>>>
>>> I get a Root-mean-square error = NaN
>>>
>>> So I am wondering where I might be going wrong
>>>
>>> Regards,
>>> VG
>>>
>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>>>
 No, that's certainly not to be expected. ALS works by computing a much
 lower-rank representation of the input. It would not reproduce the
 input exactly, and you don't want it to -- this would be seriously
 overfit. This is why in general you don't evaluate a model on the
 training set.

 On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
 > I am trying to run ml.ALS to compute some recommendations.
 >
 > Just to test I am using the same dataset for training using ALSModel
 and for
 > predicting the results based on the model .
 >
 > When I evaluate the result using RegressionEvaluator I get a
 > Root-mean-square error = 1.5544064263236066
 >
 > I thin this should be 0. Any suggestions what might be going wrong.
 >
 > Regards,
 > Vipul

>>>
>>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath

It seems likely that you're running into
https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
test dataset in the train/test split contains users or items that were not
in the training set. Hence the model doesn't have computed factors for
those ids, and ALS 'transform' currently returns NaN for those ids. This in
turn results in NaN for the evaluator result.

I have a PR open on that issue that will hopefully address this soon.


On Sun, 24 Jul 2016 at 17:49 VG  wrote:

> ping. Anyone has some suggestions/advice for me .
> It will be really helpful.
>
> VG
> On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:
>
>> Sean,
>>
>> I did this just to test the model. When I do a split of my data as
>> training to 80% and test to be 20%
>>
>> I get a Root-mean-square error = NaN
>>
>> So I am wondering where I might be going wrong
>>
>> Regards,
>> VG
>>
>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>>
>>> No, that's certainly not to be expected. ALS works by computing a much
>>> lower-rank representation of the input. It would not reproduce the
>>> input exactly, and you don't want it to -- this would be seriously
>>> overfit. This is why in general you don't evaluate a model on the
>>> training set.
>>>
>>> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
>>> > I am trying to run ml.ALS to compute some recommendations.
>>> >
>>> > Just to test I am using the same dataset for training using ALSModel
>>> and for
>>> > predicting the results based on the model .
>>> >
>>> > When I evaluate the result using RegressionEvaluator I get a
>>> > Root-mean-square error = 1.5544064263236066
>>> >
>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>> >
>>> > Regards,
>>> > Vipul
>>>
>>
>>

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni

Hello
 Uhm you have an array containing 3 tuples?
If all the arrays have same length, you can just zip all of them, creatings
a list of tuples
then you can scan the list 5 by 5...?

so something like

(Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList

this will give you a list of tuples of 3 elements containing each items
from ID1, ID2 and ID3  ... sample below
res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159,
89366)..)

then you can use a recursive function to compare each element such as

def iterate(lst:List[(Int, Int, Int)]):T = {
if (lst.isEmpty): /// return your comparison
else {
 val splits = lst.splitAt(5)
 // do sometjhing about it using splits._1
 iterate(splits._2)
   }

will this help? or am i still missing something?

kr











On 24 Jul 2016 5:52 pm, "janardhan shetty"  wrote:

> Array(
> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
> 45431, 100136)),
> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
> 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
> 45431, 162076)),
> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, 44683,
> 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431, 36318,
> 162076))
> )
>
> I need to compare first 5 elements of ID1 with first five element of ID3
> next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
> order until the end of number of elements.
> Let me know if this helps
>
>
> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni 
> wrote:
>
>> Apologies I misinterpreted could you post two use cases?
>> Kr
>>
>> On 24 Jul 2016 3:41 pm, "janardhan shetty" 
>> wrote:
>>
>>> Marco,
>>>
>>> Thanks for the response. It is indexed order and not ascending or
>>> descending order.
>>> On Jul 24, 2016 7:37 AM, "Marco Mistroni"  wrote:
>>>
 Use map values to transform to an rdd where values are sorted?
 Hth

 On 24 Jul 2016 6:23 am, "janardhan shetty" 
 wrote:

> I have a key,value pair rdd where value is an array of Ints. I need to
> maintain the order of the value in order to execute downstream
> modifications. How do we maintain the order of values?
> Ex:
> rdd = (id1,[5,2,3,15],
> Id2,[9,4,2,5])
>
> Followup question how do we compare between one element in rdd with
> all other elements ?
>

>

Outer Explode needed

2016-07-24 Thread Don Drake

I have a nested data structure (array of structures) that I'm using the DSL
df.explode() API to flatten the data.  However, when the array is empty,
I'm not getting the rest of the row in my output as it is skipped.

This is the intended behavior, and Hive supports a SQL "OUTER explode()" to
generate the row when the explode would not yield any output.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

Can we get this same outer explode in the DSL?  I have to jump through some
outer join hoops to get the rows where the array is empty.

Thanks.

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143

Re: Spark driver getting out of memory

2016-07-24 Thread Raghava Mutharaju

Saurav,

We have the same issue. Our application runs fine on 32 nodes with 4 cores
each and 256 partitions but gives an OOM on the driver when run on 64 nodes
with 512 partitions. Did you get to know the reason behind this behavior or
the relation between number of partitions and driver RAM usage?

Regards,
Raghava.

On Wed, Jul 20, 2016 at 2:08 AM, Saurav Sinha 
wrote:

> Hi,
>
> I have set driver memory 10 GB and job ran with intermediate failure which
> is recovered back by spark.
>
> But I still what to know if no of parts increases git driver ram need to
> be increased and what is ration of no of parts/RAM.
>
> @RK : I am using cache on RDD. Is this reason of high RAM utilization.
>
> Thanks,
> Saurav Sinha
>
> On Tue, Jul 19, 2016 at 10:14 PM, RK Aduri 
> wrote:
>
>> Just want to see if this helps.
>>
>> Are you doing heavy collects and persist that? If that is so, you might
>> want to parallelize that collection by converting to an RDD.
>>
>> Thanks,
>> RK
>>
>> On Tue, Jul 19, 2016 at 12:09 AM, Saurav Sinha 
>> wrote:
>>
>>> Hi Mich,
>>>
>>>1. In what mode are you running the spark standalone, yarn-client,
>>>yarn cluster etc
>>>
>>> Ans: spark standalone
>>>
>>>1. You have 4 nodes with each executor having 10G. How many actual
>>>executors do you see in UI (Port 4040 by default)
>>>
>>> Ans: There are 4 executor on which am using 8 cores
>>> (--total-executor-core 32)
>>>
>>>1. What is master memory? Are you referring to diver memory? May be
>>>I am misunderstanding this
>>>
>>> Ans: Driver memory is set as --drive-memory 5g
>>>
>>>1. The only real correlation I see with the driver memory is when
>>>you are running in local mode where worker lives within JVM process that
>>>you start with spark-shell etc. In that case driver memory matters.
>>>However, it appears that you are running in another mode with 4 nodes?
>>>
>>> Ans: I am running my job as spark-submit and on my worker(executor) node
>>> there is no OOM issue ,it only happening on driver app.
>>>
>>> Thanks,
>>> Saurav Sinha
>>>
>>> On Tue, Jul 19, 2016 at 2:42 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 can you please clarify:

1. In what mode are you running the spark standalone, yarn-client,
yarn cluster etc
2. You have 4 nodes with each executor having 10G. How many actual
executors do you see in UI (Port 4040 by default)
3. What is master memory? Are you referring to diver memory? May be
I am misunderstanding this
4. The only real correlation I see with the driver memory is when
you are running in local mode where worker lives within JVM process that
you start with spark-shell etc. In that case driver memory matters.
However, it appears that you are running in another mode with 4 nodes?

 Can you get a snapshot of your environment tab in UI and send the
 output please?

 HTH

 Dr Mich Talebzadeh

 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *

 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.

 On 18 July 2016 at 11:50, Saurav Sinha  wrote:

> I have set --drive-memory 5g. I need to understand that as no of
> partition increase drive-memory need to be increased. What will be
> best ration of No of partition/drive-memory.
>
> On Mon, Jul 18, 2016 at 4:07 PM, Zhiliang Zhu 
> wrote:
>
>> try to set --drive-memory xg , x would be as large as can be set .
>>
>>
>> On Monday, July 18, 2016 6:31 PM, Saurav Sinha <
>> sauravsinh...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I am running spark job.
>>
>> Master memory - 5G
>> executor memort 10G(running on 4 node)
>>
>> My job is getting killed as no of partition increase to 20K.
>>
>> 16/07/18 14:53:13 INFO DAGScheduler: Got job 17 (foreachPartition at
>> WriteToKafka.java:45) with 13524 output partitions (allowLocal=false)
>> 16/07/18 14:53:13 INFO DAGScheduler: Final stage: ResultStage
>> 640(foreachPartition at WriteToKafka.java:45)
>> 16/07/18 14:53:13 INFO DAGScheduler: Parents of final stage:
>> List(ShuffleMapStage 518, ShuffleMapStage 639)
>> 16/07/18 14:53:23 INFO DAGScheduler: Missing parents: List()
>> 16/07/18 14:53:23

Re: UDF to build a Vector?

2016-07-24 Thread Marco Mistroni

Hi
 what is your source data? i am guessing a  DataFrame or Integers as you
are usingan UDF
So your DataFrame is then a bunch of Row[Integer] ?
below a sample from one of my code to predict eurocup winners , going from
a DataFrame of Row[Double] to a RDD of LabeledPoint
I m not using UDF to convert to a Vector, i never tried anyone can
suggest a better way as i dont htink my approach is very good
hth

val euroQualifierDataFrame = getDataSet(sqlContext, trainDataPath)  // this
is a DataFrame of Row[Double]
val vectorRdd = euroQualifierDataFrame.map(createVectorRDD) // to an RDD of
Seq[Double], the row is now a sequence of Doubles
val data = toLabeledPointsRDD(vectorRdd, 0) // the second parameter is to
identify which item in the Seq[Double] is the Label. output will be an
RDD[LabeledPoint]

def createVectorRDD(row:Row):Seq[Double] = {
row.toSeq.map(_.asInstanceOf[Number].doubleValue)
  }

 def createLabeledPoint(row:Seq[Double], targetFeatureIdx:Int) = {
val features = row.zipWithIndex.filter(tpl => tpl._2 !=
targetFeatureIdx).map(tpl => tpl._1)
val main = row(targetFeatureIdx)
LabeledPoint(main, Vectors.dense(features.toArray))
 }

  def toLabeledPointsRDD(rddData: RDD[Seq[Double]], targetFeatureIdx:Int) =
{

rddData.map(seq => createLabeledPoint(seq, targetFeatureIdx))
  }

On Sun, Jul 24, 2016 at 5:12 PM, Jean Georges Perrin  wrote:

>
> Hi,
>
> Here is my UDF that should build a VectorUDT. How do I actually make that
> the value is in the vector?
>
> package net.jgp.labs.spark.udf;
>
> import org.apache.spark.mllib.linalg.VectorUDT;
> import org.apache.spark.sql.api.java.UDF1;
>
> public class VectorBuilder implements UDF1 {
> private static final long serialVersionUID = -2991355883253063841L;
>
> @Override
> public VectorUDT call(Integer t1) throws Exception {
> return new VectorUDT();
> }
>
> }
>
> i plan on having this used by a linear regression in ML...
>

Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread janardhan shetty

Is there any implementation of FPGrowth and Association rules in Spark
Dataframes ?
We have in RDD but any pointers to Dataframes ?

Re: spark context stop vs close

2016-07-24 Thread Sean Owen

I think this is about JavaSparkContext which implements the standard
Closeable interface for convenience. Both do exactly the same thing.

On Sun, Jul 24, 2016 at 6:27 PM, Jacek Laskowski  wrote:
> Hi,
>
> I can only find stop. Where did you find close?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Jul 23, 2016 at 3:11 PM, Mail.com  wrote:
>> Hi All,
>>
>> Where should we us spark context stop vs close. Should we stop the context 
>> first and then close.
>>
>> Are general guidelines around this. When I stop and later try to close I get 
>> RPC already closed error.
>>
>> Thanks,
>> Pradeep
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to read content of hdfs files

2016-07-24 Thread Bhupendra Mishra

I have hdfs data in zip formate which includes data, name and nameseconday
folder. Pretty much structure is like datanode, name node and secondary
node. How to read the content of data.

would be great if some can suggest tips/steps.

Thanks

Re: spark context stop vs close

2016-07-24 Thread Jacek Laskowski

Hi,

I can only find stop. Where did you find close?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sat, Jul 23, 2016 at 3:11 PM, Mail.com  wrote:
> Hi All,
>
> Where should we us spark context stop vs close. Should we stop the context 
> first and then close.
>
> Are general guidelines around this. When I stop and later try to close I get 
> RPC already closed error.
>
> Thanks,
> Pradeep
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.0.0 RC 5 -- java.lang.AssertionError: assertion failed: Block rdd_[*] is not locked for reading

2016-07-24 Thread Ameen Akel

Hello,

I'm working with Spark 2.0.0-rc5 on Mesos (v0.28.2) on a job with ~600
cores.  Every so often, depending on the task that I've run, I'll lose an
executor to an assertion.  Here's an example error:
java.lang.AssertionError: assertion failed: Block rdd_2659_0 is not locked
for reading

I've pasted the rest of the relevant failure from that node at the bottom
of this email.  As far as I can tell, this occurs when I apply a series of
transformations from RDD->Dataframe, but then don't generate an action in
rapid succession.
I'm not sure whether this is a bug, or if something is wrong with my Spark
configuration.  Has someone encountered this error before?

Thanks!
Ameen

---

16/07/24 09:08:25 INFO MemoryStore: Block rdd_2659_0 stored as values in
memory (estimated size 1269.8 KB, free 319.8 GB)
16/07/24 09:08:25 INFO CodeGenerator: Code generated in 10.279499 ms
16/07/24 09:08:25 INFO PythonRunner: Times: total = 31, boot = -14876, init
= 14906, finish = 1
16/07/24 09:08:25 WARN Executor: 1 block locks were not released by TID =
94279:
[rdd_2659_0]
16/07/24 09:08:25 ERROR Utils: Uncaught exception in thread stdout writer
for /usr/bin/python
java.lang.AssertionError: assertion failed: Block rdd_2659_0 is not locked
for reading
at scala.Predef$.assert(Predef.scala:170)
at
org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:294)
at
org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:628)
at
org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:435)
at
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
Source)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
16/07/24 09:08:25 INFO Executor: Finished task 0.0 in stage 410.0 (TID
94279). 6414 bytes result sent to driver
16/07/24 09:08:25 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[stdout writer for /usr/bin/python,5,main]
java.lang.AssertionError: assertion failed: Block rdd_2659_0 is not locked
for reading
at scala.Predef$.assert(Predef.scala:170)
at
org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:294)
at
org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:628)
at
org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:435)
at
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
Source)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
16/07/24 09:08:25 INFO DiskBlockManager: Shutdown hook called
16/07/24 09:08:25 INFO ShutdownHookManager: Shutdown hook called
16/07/24 09:08:25 INFO ShutdownHookManager: Deleting directory

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Array(
(ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
45431, 162076)),
(ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, 44683,
19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431, 36318,
162076))
)

I need to compare first 5 elements of ID1 with first five element of ID3
next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
order until the end of number of elements.
Let me know if this helps


On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni  wrote:

> Apologies I misinterpreted could you post two use cases?
> Kr
>
> On 24 Jul 2016 3:41 pm, "janardhan shetty"  wrote:
>
>> Marco,
>>
>> Thanks for the response. It is indexed order and not ascending or
>> descending order.
>> On Jul 24, 2016 7:37 AM, "Marco Mistroni"  wrote:
>>
>>> Use map values to transform to an rdd where values are sorted?
>>> Hth
>>>
>>> On 24 Jul 2016 6:23 am, "janardhan shetty" 
>>> wrote:
>>>
 I have a key,value pair rdd where value is an array of Ints. I need to
 maintain the order of the value in order to execute downstream
 modifications. How do we maintain the order of values?
 Ex:
 rdd = (id1,[5,2,3,15],
 Id2,[9,4,2,5])

 Followup question how do we compare between one element in rdd with all
 other elements ?

>>>

java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin

I try to build a simple DataFrame that can be used for ML


SparkConf conf = new SparkConf().setAppName("Simple prediction 
from Text File").setMaster("local");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

sqlContext.udf().register("vectorBuilder", new VectorBuilder(), 
new VectorUDT());

String filename = "data/tuple-data-file.csv";
StructType schema = new StructType(
new StructField[] { new StructField("C0", 
DataTypes.StringType, false, Metadata.empty()),
new StructField("C1", 
DataTypes.IntegerType, false, Metadata.empty()),
new StructField("features", new 
VectorUDT(), false, Metadata.empty()), });

DataFrame df = 
sqlContext.read().format("com.databricks.spark.csv").schema(schema).option("header",
 "false")
.load(filename);
df = df.withColumn("label", df.col("C0")).drop("C0");
df = df.withColumn("value", df.col("C1")).drop("C1");
df.printSchema();
Returns:
root
 |-- features: vector (nullable = false)
 |-- label: string (nullable = false)
 |-- value: integer (nullable = false)
df.show();
Returns:

java.lang.RuntimeException: Unsupported type: vector
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost): java.lang.RuntimeException: Unsupported type: vector
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin


Hi,

Here is my UDF that should build a VectorUDT. How do I actually make that the 
value is in the vector?

package net.jgp.labs.spark.udf;

import org.apache.spark.mllib.linalg.VectorUDT;
import org.apache.spark.sql.api.java.UDF1;

public class VectorBuilder implements UDF1 {
private static final long serialVersionUID = -2991355883253063841L;

@Override
public VectorUDT call(Integer t1) throws Exception {
return new VectorUDT();
}

}

i plan on having this used by a linear regression in ML...

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread VG

ping. Anyone has some suggestions/advice for me .
It will be really helpful.

VG

On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using ALSModel
>> and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>

Restarting Spark Streaming Job periodically

2016-07-24 Thread Prashant verma

Hi All,
   I want to restart my spark streaming job periodically after every 15
mins using JAVA. Is it possible and if yes, how should i proceed.


Thanks,
Prashant

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni

Apologies I misinterpreted could you post two use cases?
Kr

On 24 Jul 2016 3:41 pm, "janardhan shetty"  wrote:

> Marco,
>
> Thanks for the response. It is indexed order and not ascending or
> descending order.
> On Jul 24, 2016 7:37 AM, "Marco Mistroni"  wrote:
>
>> Use map values to transform to an rdd where values are sorted?
>> Hth
>>
>> On 24 Jul 2016 6:23 am, "janardhan shetty" 
>> wrote:
>>
>>> I have a key,value pair rdd where value is an array of Ints. I need to
>>> maintain the order of the value in order to execute downstream
>>> modifications. How do we maintain the order of values?
>>> Ex:
>>> rdd = (id1,[5,2,3,15],
>>> Id2,[9,4,2,5])
>>>
>>> Followup question how do we compare between one element in rdd with all
>>> other elements ?
>>>
>>

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Marco,

Thanks for the response. It is indexed order and not ascending or
descending order.
On Jul 24, 2016 7:37 AM, "Marco Mistroni"  wrote:

> Use map values to transform to an rdd where values are sorted?
> Hth
>
> On 24 Jul 2016 6:23 am, "janardhan shetty"  wrote:
>
>> I have a key,value pair rdd where value is an array of Ints. I need to
>> maintain the order of the value in order to execute downstream
>> modifications. How do we maintain the order of values?
>> Ex:
>> rdd = (id1,[5,2,3,15],
>> Id2,[9,4,2,5])
>>
>> Followup question how do we compare between one element in rdd with all
>> other elements ?
>>
>

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Marco Mistroni

Hi how bout creating an auto increment column in hbase?
Hth

On 24 Jul 2016 3:53 am, "yeshwanth kumar"  wrote:

> Hi,
>
> i am doing bulk load to hbase using spark,
> in which i need to generate a sequential key for each record,
> the key should be sequential across all the executors.
>
> i tried zipwith index, didn't worked because zipwith index gives index per
> executor not across all executors.
>
> looking for some suggestions.
>
>
> Thanks,
> -Yeshwanth
>

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang

Hi Janardhan,

Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992)
for the discussion about LSH.

Regards
Yanbo

2016-07-24 7:13 GMT-07:00 Karl Higley :

> Hi Janardhan,
>
> I collected some LSH papers while working on an RDD-based implementation.
> Links at the end of the README here:
> https://github.com/karlhigley/spark-neighbors
>
> Keep me posted on what you come up with!
>
> Best,
> Karl
>
> On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty 
> wrote:
>
>> I was looking through to implement locality sensitive hashing in
>> dataframes.
>> Any pointers for reference?
>>
>

Re: Locality sensitive hashing

2016-07-24 Thread Karl Higley

Hi Janardhan,

I collected some LSH papers while working on an RDD-based implementation.
Links at the end of the README here:
https://github.com/karlhigley/spark-neighbors

Keep me posted on what you come up with!

Best,
Karl

On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty 
wrote:

> I was looking through to implement locality sensitive hashing in
> dataframes.
> Any pointers for reference?
>

Locality sensitive hashing

2016-07-24 Thread janardhan shetty

I was looking through to implement locality sensitive hashing in dataframes.
Any pointers for reference?

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang

Sorry for the wrong link, what you should refer is jpmml-sparkml (
https://github.com/jpmml/jpmml-sparkml).

Thanks
Yanbo

2016-07-24 4:46 GMT-07:00 Yanbo Liang :

> Spark does not support exporting ML models to PMML currently. You can try
> the third party jpmml-spark (https://github.com/jpmml/jpmml-spark)
> package which supports a part of ML models.
>
> Thanks
> Yanbo
>
> 2016-07-20 11:14 GMT-07:00 Ajinkya Kale :
>
>> Just found Google dataproc has a preview of spark 2.0. Tried it and
>> save/load works! Thanks Shuai.
>> Followup question - is there a way to export the pyspark.ml models to
>> PMML ? If not, what is the best way to integrate the model for inference in
>> a production service ?
>>
>> On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale 
>> wrote:
>>
>>> I am using google cloud dataproc which comes with spark 1.6.1. So
>>> upgrade is not really an option.
>>> No way / hack to save the models in spark 1.6.1 ?
>>>
>>> On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin 
>>> wrote:
>>>
 It's added in not-released-yet 2.0.0 version.

 https://issues.apache.org/jira/browse/SPARK-13036
 https://github.com/apache/spark/commit/83302c3b

 so i guess you need to wait for 2.0 release (or use the current rc4).

 On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale 
 wrote:

> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib
> has that but mllib does not have PCA afaik. How do people do model
> persistence for inference using the pyspark ml models ? Did not find any
> documentation on model persistency for ml.
>
> --ajinkya
>


>

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang

Spark does not support exporting ML models to PMML currently. You can try
the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package
which supports a part of ML models.

Thanks
Yanbo

2016-07-20 11:14 GMT-07:00 Ajinkya Kale :

> Just found Google dataproc has a preview of spark 2.0. Tried it and
> save/load works! Thanks Shuai.
> Followup question - is there a way to export the pyspark.ml models to
> PMML ? If not, what is the best way to integrate the model for inference in
> a production service ?
>
> On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale 
> wrote:
>
>> I am using google cloud dataproc which comes with spark 1.6.1. So upgrade
>> is not really an option.
>> No way / hack to save the models in spark 1.6.1 ?
>>
>> On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin  wrote:
>>
>>> It's added in not-released-yet 2.0.0 version.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13036
>>> https://github.com/apache/spark/commit/83302c3b
>>>
>>> so i guess you need to wait for 2.0 release (or use the current rc4).
>>>
>>> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale 
>>> wrote:
>>>
 Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib
 has that but mllib does not have PCA afaik. How do people do model
 persistence for inference using the pyspark ml models ? Did not find any
 documentation on model persistency for ml.

 --ajinkya

>>>
>>>

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-24 Thread Julien Nauroy

Hi again, 

Just another strange behavior I stumbled upon. Can anybody reproduce it? 
Here's the code snippet in scala: 
var df1 = spark.read.parquet(fileName) 


df1 = df1.withColumn("newCol", df1.col("anyExistingCol")) 
df1.printSchema() // here newCol exists 
df1 = df1.flatMap(x => List(x)) 
df1.printSchema() // newCol has disappeared 

Any idea what I could be doing wrong? Why would newCol disappear? 


Cheers, 
Julien 



- Mail original -

De: "Julien Nauroy"  
À: "Sun Rui"  
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 23:39:08 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

Thanks, it works like a charm now! 

Not sure how I could have found it by myself though. 
Maybe the error message when you don't specify the encoder should point to 
RowEncoder. 


Cheers, 
Julien 

- Mail original -

De: "Sun Rui"  
À: "Julien Nauroy"  
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 16:27:43 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

You should use : 
import org.apache.spark.sql.catalyst.encoders.RowEncoder 

val df = spark.read.parquet(fileName) 

implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) 

val df1 = df.flatMap { x => List(x) } 



On Jul 23, 2016, at 22:01, Julien Nauroy < julien.nau...@u-psud.fr > wrote: 

Thanks for your quick reply. 

I've tried with this encoder: 
implicit def RowEncoder: org.apache.spark.sql.Encoder[Row] = 
org.apache.spark.sql.Encoders.kryo[Row] 
Using a suggestion from 
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6
 

How did you setup your encoder? 


- Mail original -

De: "Sun Rui" < sunrise_...@163.com > 
À: "Julien Nauroy" < julien.nau...@u-psud.fr > 
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 15:55:21 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

I did a try. the schema after flatMap is the same, which is expected. 

What’s your Row encoder? 



On Jul 23, 2016, at 20:36, Julien Nauroy < julien.nau...@u-psud.fr > wrote: 

Hi, 

I'm trying to call flatMap on a Dataframe with Spark 2.0 (rc5). 
The code is the following: 
var data = spark.read.parquet(fileName).flatMap(x => List(x)) 

Of course it's an overly simplified example, but the result is the same. 
The dataframe schema goes from this: 
root 
|-- field1: double (nullable = true) 
|-- field2: integer (nullable = true) 
(etc) 

to this: 
root 
|-- value: binary (nullable = true) 

Plus I have to provide an encoder for Row. 
I expect to get the same schema after calling flatMap. 
Any idea what I could be doing wrong? 


Best regards, 
Julien

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Pedro Rodriguez

If you can use a dataframe then you could use rank + window function at the
expense of an extra sort. Do you have an example of zip with index not
working, that seems surprising.
On Jul 23, 2016 10:24 PM, "Andrew Ehrlich"  wrote:

> It’s hard to do in a distributed system. Maybe try generating a meaningful
> key using a timestamp + hashed unique key fields in the record?
>
> > On Jul 23, 2016, at 7:53 PM, yeshwanth kumar 
> wrote:
> >
> > Hi,
> >
> > i am doing bulk load to hbase using spark,
> > in which i need to generate a sequential key for each record,
> > the key should be sequential across all the executors.
> >
> > i tried zipwith index, didn't worked because zipwith index gives index
> per executor not across all executors.
> >
> > looking for some suggestions.
> >
> >
> > Thanks,
> > -Yeshwanth
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang

Hi Gourav,

I can not reproduce your problem. The following code snippets works well on
my local machine, you can try to verify it in your environment. Or could
you provide more information to make others can reproduce your problem?

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
l = [(1, 1, 10), (2, 2, 20), (3, 3, 30)]
df = sqlContext.createDataFrame(l, ['row', 'column', 'value'])
rdd = df.select('row', 'column', 'value').rdd.map(lambda row:
MatrixEntry(*row))
mat = CoordinateMatrix(rdd)
mat.entries.collect()

Thanks
Yanbo



2016-07-22 13:14 GMT-07:00 Gourav Sengupta :

> Hi,
>
> I had a sparse matrix and I wanted to add the value of a particular row
> which is identified by a particular number.
>
> from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
> mat =
> CoordinateMatrix(all_scores_df.select('ID_1','ID_2','value').rdd.map(lambda
> row: MatrixEntry(*row)))
>
>
> This gives me the number or rows and columns. But I am not able to extract
> the values and it always reports back the error:
>
> AttributeError: 'NoneType' object has no attribute 'setCallSite'
>
>
> Thanks and Regards,
>
> Gourav Sengupta
>
>

Re: NoClassDefFoundError with ZonedDateTime

2016-07-24 Thread Timur Shenkao

Which version of Java 8 do you use? AFAIK, it's recommended to exploit Java
1.8_0.66 +

On Fri, Jul 22, 2016 at 8:49 PM, Jacek Laskowski  wrote:

> On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu  wrote:
> > You can use this command (assuming log aggregation is turned on):
> >
> > yarn logs --applicationId XX
>
> I don't think it's gonna work for already-running application (and I
> wish I were mistaken since I needed it just yesterday) and you have to
> revert to stderr of ApplicationMaster in container 1.
>
> Jacek
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Sean Owen

Also also, you may be interested in GATK, built on Spark, for genomics:
https://github.com/broadinstitute/gatk


On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor  wrote:
> Hi James,
> BTW - if you are into analyzing DNA with Spark, you may also be interested
> in ADAM:
>https://github.com/bigdatagenomics/adam
> http://bdgenomics.org/
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>
> On Fri, Jul 22, 2016 at 10:31 PM, James McCabe  wrote:
>>
>> Hi!
>>
>> I hope this may be of use/interest to someone:
>>
>> Spark, a Worked Example: Speeding Up DNA Sequencing
>>
>>
>> http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html
>>
>> James
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Ofir Manor

Hi James,
BTW - if you are into analyzing DNA with Spark, you may also be interested
in ADAM:
   https://github.com/bigdatagenomics/adam
http://bdgenomics.org/

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Fri, Jul 22, 2016 at 10:31 PM, James McCabe  wrote:

> Hi!
>
> I hope this may be of use/interest to someone:
>
> Spark, a Worked Example: Speeding Up DNA Sequencing
>
>
> http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html
>
> James
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

46 matches

Mail list logo