How to consider HTML files in Spark

2015-03-12 Thread yh18190
Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html content
}
loaddata=sc.textFile(filepath).map(readhtml)

The problem is here spark considers loaded file as textfile and goes through
process line by line.I want to consider to load the entire html content into
Beautifulsoup for further processing..
Does anyone have any idea to how to take the whole html file as input
instead of linebyline processing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to ship external Python libraries in PYSPARK

2014-10-07 Thread yh18190
Hi David,

Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread yh18190
Hi all,

I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each and every
worker machine to install these libs in python path.I tried to use
Sparkcontext pyfiles option to ship .zip files..But the problem is  these
python packages needs to be get installed on worker machines.Could anyone
let me know wat are different ways of doing it so that this lib-Textblob
could be available in python path.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Request for help in writing to Textfile

2014-08-25 Thread yh18190
Hi Guys,

I am currently playing with huge data.I have an RDD which returns
RDD[List[(tuples)]].I need only the tuples to be written to textfile output
using saveAsTextFile function.
example:val mod=modify.saveASTextFile()  returns 

List((20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1),
(20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1))
List((20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1),
(20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1)

I need following output with only tuple values in a textfile.
20140813,4,141127,3,HYPHLJLU,HY,KNGHWEB,USD,144.00,662.40,KY1
20140813,4,141127,3,HYPHLJLU,HY,DBLHWEB,USD,144.00,662.40,KY1


Please let me know if anybody has anyidea regarding this without using
collect() function...Please help me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Request-for-help-in-writing-to-Textfile-tp12744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Request for Help

2014-08-25 Thread yh18190
Hi Guys,

I just want to know whether their is any way to determine which file is
being handled by spark from a group of files input inside a
directory.Suppose I have 1000 files which are given as input,I want to
determine which file is being handled currently by spark program so that if
any error creeps in at any point of time we can easily determine that
particular file as faulty one.

Please let me know your thoughts.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Request-for-Help-tp12776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is their a way to Create SparkContext object?

2014-05-12 Thread yh18190
Hi,

Could anyone suggest an idea how can we create sparkContext object in other
classes or fucntions where we need to convert a scala collection to RDD
using sc object.like sc.makeRDD(list).instead of using Main class
sparkcontext object?
is their  a way to pass sc object as a parameter to function in other
classes?
Please let me know



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-their-a-way-to-Create-SparkContext-object-tp5612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Regarding Partitioner

2014-04-16 Thread yh18190
Hi,,

I have  large dataset of elemenst[RDD] and i want to divide it into two
exactly equal sized partitions maintaining order of elements.I tried using
RangePartitioner like  var data= partitionedFile.partitionBy(new
RangePartitioner(2, partitionedFile)).
This doesnt give satisfactory results becoz it divides roughly but not
exactly equal sized maintaining order of elements..
for example
if their are 64 elements ..we use
Rangepartitioner.then it divides in to 31 elements and 33 elements..

I need partitioner such that i get exactly frirst 32 elements in one half
and other half contains second set of 32 elements..
Guys could anyone hlep me by suggestiing how to use customised partitioner
such that I get equally sized two halves...maintaing the order of elements..

Please help me...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-Partitioner-tp4356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Problem with KryoSerializer

2014-04-15 Thread yh18190
Hi,

I have a problem when i want to use spark kryoserializer by extending a
class Kryoregistarar to register custom classes inorder to create objects.I
am getting following exception When I run following program..Please let me
know what could be the problem...
] (run-main) org.apache.spark.SparkException: Job failed:
java.io.NotSerializableException: main.scala.Utilities

Registering classes objects:

package main.scala
import com.esotericsoftware.kryo
import org.apache.spark.serializer.KryoRegistrator
import com.esotericsoftware.kryo._

class MykryoRegistrar extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[main.scala.Meter_data])
kryo.register(classOf[main.scala.Utilities])  
  }
}
MeterData_PerDay:Main class

object MeterData_PerDay {
 
   def main(args: Array[String]) {

   System.setProperty(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
System.setProperty(spark.kryo.registrator,
main.scala.MykryoRegistrar)

 var utilclass:Utilities = new Utilities()

val sc = new SparkContext(local, Simple App,
utilclass.spark_home,
 List(target/scala-2.9.3/simple-project_2.9.3-1.0.jar))

val file = sc.textFile(utilclass.data_home)

}}
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-KryoSerializer-tp4295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to index each map operation????

2014-04-02 Thread yh18190
Hi Therry,

Thanks for the above responses..I implemented using RangePartitioner..we
need to use any of the custom partitioners in orderto perform this
task..Normally u cant maintain a counter becoz count operations should
beperformed on each partitioned block ofdata...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-index-each-map-operation-tp3471p3624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread yh18190
Hi,

Can we convert directly scala collection to spark RDD data type without
using parellize method?
Is their any way to create custom converted RDD datatype from scala type
using some typecast like that?

Please suggest me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-convert-scala-collection-ArrayBuffer-Int-Double-to-org-spark-RDD-Int-Double-tp3486.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Hi,
I have an RDD of elements and want to create a new RDD by Zipping other RDD
in order.
result[RDD] with sequence of 10,20,30,40,50 ...elements.
I am facing problems as index is not an RDD...its gives an error...Could
anyone help me how we can zip it or map it inorder to obtain following
result.(0,10),(1,20),(2,30),(3,40)
I tried like this...but doesnt work...even zipWithIndex doesnt work becoz
its scala method..not RDD method..

 val index= List.range(0, result.count(),1)
result.zip(index)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Zip-or-map-elements-to-create-new-RDD-tp3467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Zip or map elements to create new RDD

2014-03-29 Thread yh18190
Thanks sonal.Is der anyother way like to map values with Increasing
indexes...so that i can map(t=(i,t)) where value if 'i' increases after
each map operation on element...

Please help me ..in this aspect 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Zip-or-map-elements-to-create-new-RDD-tp3467p3470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to index each map operation????

2014-03-29 Thread yh18190
Hi,

I want to perform map operation on an RDD of elements such that resulting
RDD is a key value pair(counter,value) 

For example var k:RDD[Int]=10,20,30,40,40,60...
k.map(t=(i,t))  where 'i' value should be like a counter whose value
increments after each mapoperation...
Pleas help me..
I tried to wirte like this but didnt work out..
var i=0;
k.map(t={
(i,t);i+=1;
}) 

please correct me...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-index-each-map-operation-tp3471.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I
have RDD of say 24 elements.In that when i partioned into two groups of 12
elements each.Their is loss of order of elements in partition.Elemest are
partitioned randomly.I need to preserve the order such that the first 12
elements should be 1st partition and 2nd 12 elemts should be in 2nd
partition.
Guys please help me how to main order of original sequence even after
partioningAny solution
Before Partition:RDD
64
29186
16059
9143
6439
6155
9187
18416
25565
30420
33952
38302
43712
47092
48803
52687
56286
57471
63429
70715
75995
81878
80974
71288
48556
After Partition:In group1 with 12 elements
64,
29186,
18416
30420
33952
38302
43712
47092
56286
81878
80974
71288
48556



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread yh18190
Hi Andriana,

Thanks for suggestion.Could you please modify my code part where I need to
do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply
appropriately..i would be thankful to you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values  at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index elements directly using
forloop etc..(i,i+1)...is their way to resolve this problem?Please suggest
me ..i would be really thankful to you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
We need some one who can explain us with short code snippet on given example
so that we get clear cut idea  on RDDs indexing..
Guys please help us



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Regarding Successive operation on elements and recursively

2014-03-18 Thread yh18190
 Hi ,
I am new to Spark scala environment.Currently I am working on Discrete
wavelet transformation algos on time series data.
 I  have to perform recursive additions on successive elements in RDDs.
 for example
 List of elements(RDDS) --a1 a2 a3 a4.
 level1 Tranformation --a1+a2  a3+a4  a1-a2  a3-a4
 level 2---(a1+a2)+(a3+a4) (a1+a2)-(a3+a4)

Is their a way to provide indexing to elements in distributed environment
across nodes so that I know that i am referring to a2 after a1 ..I want to
perform successive addition of only two elements and in a recursive manner
..

Could you please help me in this aspect..I would be really thankful to you..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Regarding-Successive-operation-on-elements-and-recursively-tp2826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.