static dataframe to streaming

2019-11-05 Thread aka.fe2s
Hi All,

What is the most efficient way of converting static dataframe to streaming
(structured streaming)?
I have a custom sink implemented for structured streaming and I would like
to use it to write a static dataframe.
I know that I can write a dataframe to files and then source them to a
create a stream. Is there any more efficient way to do this?

--
Oleksiy


off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread aka.fe2s
Hi folks,

What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention it
no longer.

--
Oleksiy Dyagilev


Re: How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread aka.fe2s
Most likely you are missing an import statement that enables some Scala
implicits. I haven't used this connector, but looks like you need "import
com.couchbase.spark._"

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 9:42 AM, Devi P.V  wrote:

> I am newbie in CouchBase.I am trying to write data into CouchBase.My
> sample code is following,
>
> val cfg = new SparkConf()
>   .setAppName("couchbaseQuickstart")
>   .setMaster("local[*]")
>   .set("com.couchbase.bucket.MyBucket","pwd")
>
> val sc = new SparkContext(cfg)
> val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
> "content"))
> val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", "content", 
> "in", "here"))
>
> val data = sc
>   .parallelize(Seq(doc1, doc2))
>
> But I can't access data.saveToCouchbase().
>
> I am using Spark 1.6.1 & Scala 2.11.8
>
> I gave following dependencies in built.sbt
>
> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.6.1"
> libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % 
> "1.2.1"
>
>
> How can I write data into CouchBase using Spark & Scala?
>
>
>
>
>


Re: LabeledPoint creation

2016-09-07 Thread aka.fe2s
It has 4 categories
a = 1 0 0
b = 0 0 0
c = 0 1 0
d = 0 0 1

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> Any help on above mail use case ?
>
> Regards,
> Rajesh
>
> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>> dataset(example code from spark). For this, I am using One-hot encoding
>>  feature. Below is my code
>>
>> val df = sparkSession.createDataFrame(Seq(
>>   (0, "a"),
>>   (1, "b"),
>>   (2, "c"),
>>   (3, "a"),
>>   (4, "a"),
>>   (5, "c"),
>>   (6, "d"))).toDF("id", "category")
>>
>> val indexer = new StringIndexer()
>>   .setInputCol("category")
>>   .setOutputCol("categoryIndex")
>>   .fit(df)
>>
>> val indexed = indexer.transform(df)
>>
>> indexed.select("category", "categoryIndex").show()
>>
>> val encoder = new OneHotEncoder()
>>   .setInputCol("categoryIndex")
>>   .setOutputCol("categoryVec")
>> val encoded = encoder.transform(indexed)
>>
>>  encoded.select("id", "category", "categoryVec").show()
>>
>> *Output :- *
>> +---++-+
>> | id|category|  categoryVec|
>> +---++-+
>> |  0|   a|(3,[0],[1.0])|
>> |  1|   b|(3,[],[])|
>> |  2|   c|(3,[1],[1.0])|
>> |  3|   a|(3,[0],[1.0])|
>> |  4|   a|(3,[0],[1.0])|
>> |  5|   c|(3,[1],[1.0])|
>> |  6|   d|(3,[2],[1.0])|
>> +---++-+
>>
>> *Creating LablePoint from encoded dataframe:-*
>>
>> val data = encoded.rdd.map { x =>
>>   {
>> val featureVector = Vectors.dense(x.getAs[org.apac
>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>> val label = x.getAs[java.lang.Integer]("id").toDouble
>> LabeledPoint(label, featureVector)
>>   }
>> }
>>
>> data.foreach { x => println(x) }
>>
>> *Output :-*
>>
>> (0.0,[1.0,0.0,0.0])
>> (1.0,[0.0,0.0,0.0])
>> (2.0,[0.0,1.0,0.0])
>> (3.0,[1.0,0.0,0.0])
>> (4.0,[1.0,0.0,0.0])
>> (5.0,[0.0,1.0,0.0])
>> (6.0,[0.0,0.0,1.0])
>>
>> I have a four categorical values like a, b, c, d. I am expecting 4
>> features in the above LablePoint but it has only 3 features.
>>
>> Please help me to creation of LablePoint from categorical features.
>>
>> Regards,
>> Rajesh
>>
>>
>>
>


Re: ml and mllib persistence

2016-07-12 Thread aka.fe2s
Okay, I think I found an answer on my question. Some models (for instance
org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs,
so just serializing these objects will not work.

--
Oleksiy Dyagilev

On Tue, Jul 12, 2016 at 5:40 PM, aka.fe2s <aka.f...@gmail.com> wrote:

> What is the reason Spark has an individual implementations of read/write
> routines for every model in mllib and ml? (Saveable and MLWritable trait
> impls)
>
> Wouldn't a generic implementation via Java serialization mechanism work? I
> would like to use it to store the models to a custom storage.
>
> --
> Oleksiy
>


Re: location of a partition in the cluster/ how parallelize method distribute the RDD partitions over the cluster.

2016-07-12 Thread aka.fe2s
The local collection is distributed into the cluster when you run any
action http://spark.apache.org/docs/latest/programming-guide.html#actions
due to laziness of RDD.

If you want to control how the collection is split into parititions, you
can create your own RDD implementation and implement this logic
in getPartitions/compute methods. See the ParallelCollectionRDD as a
reference.

--
Oleksiy Dyagilev

On Sun, Jul 10, 2016 at 3:58 PM, Mazen  wrote:

> Hi,
>
> Any hint about getting the location of a particular RDD partition on the
> cluster? a workaround?
>
>
> Parallelize method on RDDs partitions the RDD into splits  as specified or
> per as per the  default parallelism configuration. Does parallelize
> actually
> distribute the partitions into the cluster or the partitions are kept on
> the
> driver node. In the first case is there a protocol for assigning/mapping
> partitions (parallelocollectionpartition) to workers or it is just random.
> Otherwise, when partitions are distributed on the cluster? Is that when
> tasks are launched on partitions?
>
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/location-of-a-partition-in-the-cluster-how-parallelize-method-distribute-the-RDD-partitions-over-the-tp27316.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


ml and mllib persistence

2016-07-12 Thread aka.fe2s
What is the reason Spark has an individual implementations of read/write
routines for every model in mllib and ml? (Saveable and MLWritable trait
impls)

Wouldn't a generic implementation via Java serialization mechanism work? I
would like to use it to store the models to a custom storage.

--
Oleksiy


Re: Reading Back a Cached RDD

2016-03-28 Thread aka.fe2s
Nick, what is your use-case?


On Thu, Mar 24, 2016 at 11:55 PM, Marco Colombo  wrote:

> You can persist off-heap, for example with tachyon, now called Alluxio.
> Take a look at off heap peristance
>
> Regards
>
>
> Il giovedì 24 marzo 2016, Holden Karau  ha scritto:
>
>> Even checkpoint() is maybe not exactly what you want, since if reference
>> tracking is turned on it will get cleaned up once the original RDD is out
>> of scope and GC is triggered.
>> If you want to share persisted RDDs right now one way to do this is
>> sharing the same spark context (using something like the spark job server
>> or IBM Spark Kernel).
>>
>> On Thu, Mar 24, 2016 at 11:28 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Isn’t persist() only for reusing an RDD within an active application?
>>> Maybe checkpoint() is what you’re looking for instead?
>>> ​
>>>
>>> On Thu, Mar 24, 2016 at 2:02 PM Afshartous, Nick <
>>> nafshart...@turbine.com> wrote:
>>>

 Hi,


 After calling RDD.persist(), is then possible to come back later and
 access the persisted RDD.

 Let's say for instance coming back and starting a new Spark shell
 session.  How would one access the persisted RDD in the new shell session ?


 Thanks,

 --

Nick

>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
> --
> Ing. Marco Colombo
>



-- 
--
Oleksiy Dyagilev


Re: HdfsWordCount only counts some of the words

2014-09-24 Thread aka.fe2s
I guess because this example is stateless, so it outputs counts only for
given RDD. Take a look at stateful word counter
StatefulNetworkWordCount.scala

On Wed, Sep 24, 2014 at 4:29 AM, SK skrishna...@gmail.com wrote:


 I execute it as follows:

 $SPARK_HOME/bin/spark-submit   --master master url  --class
 org.apache.spark.examples.streaming.HdfsWordCount
 target/scala-2.10/spark_stream_examples-assembly-1.0.jar  hdfsdir

 After I start the job, I add a new test file in hdfsdir. It is a large text
 file which I will not be able to copy here. But it  probably has at least
 100 distinct words. But the streaming output has only about 5-6 words along
 with their counts as follows. I then stop the job after some time.

 Time ...

 (word1, cnt1)
 (word2, cnt2)
 (word3, cnt3)
 (word4, cnt4)
 (word5, cnt5)

 Time ...

 Time ...




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread aka.fe2s
Hi,

I'm looking for available online ML algorithms (that improve model with new
streaming data). The only one I found is linear regression.
Is there anything else implemented as part of MLlib?

Thanks, Oleksiy.