SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
Hi I am running into trouble with a nested query using python. To try and debug it, I first wrote the query I want using sqlite3 select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) from Frequency as freq, (select term, docid, count from Frequency) as

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Andy Davidson
(select term, docid, count from Frequency) freqTranspose where freq.term = freqTranspose.term group by freq.docid, freqTranspose.docid) Michael On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am running into trouble

how to report documentation bug?

2014-09-16 Thread Andy Davidson
http://spark.apache.org/docs/latest/quick-start.html#standalone-applications Click on java tab There is a bug in the maven section version1.1.0-SNAPSHOT/version Should be version1.1.0/version Hope this helps Andy

spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-17 Thread Andy Davidson
Hi I am new to spark. I am trying to write a simple java program that process tweets that where collected and stored in a file. I figured the simplest thing to do would be to convert the JSON string into a java map. When I submit my jar file I keep getting the following error

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Andy Davidson
After lots of hacking I figure out how to resolve this problem. This is good solution. It severalty cripples jackson but at least for now I am unblocked 1) turn off annotations. mapper.configure(Feature.USE_ANNOTATIONS, false); 2) in maven set the jackson dependencies as provided.

RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Andy Davidson
Hi I am wrote a little java job to try and figure out how RDD pipe works. Bellow is my test shell script. If in the script I turn on debugging I get output. In my console. If debugging is turned off in the shell script, I do not see anything in my console. Is this a bug or feature? I am running

spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-25 Thread Andy Davidson
Hi I am running into trouble using iPython notebook on my cluster. Use the following command to set the cluster up $ ./spark-ec2 --key-pair=$KEY_PAIR --identity-file=$KEY_FILE --region=$REGION --slaves=$NUM_SLAVES launch $CLUSTER_NAME On master I launch python as follows $

problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Andy Davidson
`%matplotlib` not found Maybe you have Python 2.7 on master but Python 2.6 in cluster, you should upgrade python to 2.7 in cluster, or use python 2.6 in master by set PYSPARK_PYTHON=python2.6 On Thu, Sep 25, 2014 at 5:11 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am

Re: problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Andy Davidson
or not, but if you want to try manually upgrading Python on a cluster launched by spark-ec2, there are some instructions in the comments here https://issues.apache.org/jira/browse/SPARK-922 for doing so. Nick ​ On Fri, Sep 26, 2014 at 2:18 PM, Andy Davidson a...@santacruzintegration.com wrote

iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Andy Davidson
Hi I am having a heck of time trying to get python to work correctly on my cluster created using the spark-ec2 script The following link was really helpful https://issues.apache.org/jira/browse/SPARK-922 I am still running into problem with matplotlib. (it works fine on my mac). I can not

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Andy Davidson
, that will be problematic. Finally, there is an open pull request https://github.com/apache/spark/pull/2554 related to IPython that may be relevant, though I haven’t looked at it too closely. Nick ​ On Sat, Sep 27, 2014 at 7:33 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am having

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Andy Davidson
it to the slaves, that will be problematic. Finally, there is an open pull request https://github.com/apache/spark/pull/2554 related to IPython that may be relevant, though I haven’t looked at it too closely. Nick ​ On Sat, Sep 27, 2014 at 7:33 PM, Andy Davidson

newbie system architecture problem, trouble using streaming and RDD.pipe()

2014-09-29 Thread Andy Davidson
Hello I am trying to build a system that does a very simple calculation on a stream and displays the results in a graph that I want to update the graph every second or so. I think I have a fundamental mis understanding about how steams and rdd.pipe() works. I want to do the data visualization

how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Andy Davidson
Hi I have a simple streaming app. All I want to do is figure out how many lines I have received in the current mini batch. If numLines was a JavaRDD I could simply call count(). How do you do something similar in Streaming? Here is my psudo code JavaDStreamString msg =

Re: how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Andy Davidson
be changed for Java and probably the function argument syntax is wrong too, but hopefully there's enough there to help. Jon On Tue, Sep 30, 2014 at 3:42 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I have a simple streaming app. All I want to do is figure out how many lines I

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Andy Davidson
a DStream of RDDs. You want the count for each RDD. DStream.count() gives you exactly that: a DStream of Longs which are the counts of events in each mini batch. On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I have a simple streaming app. All I

can I think of JavaDStream foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi I am new to Spark Streaming. Can I think of JavaDStream foreachRDD() as being 'for each mini batch¹? The java doc does not say much about this function. Here is the background. I am writing a little test program to figure out how to use streams. At some point I wanted to calculate an

Re: can I think of JavaDStream foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
about since the logging will happen on a potentially remote receiver. I am not sure if this explains your observed behavior; it depends on what you were logging. On Wed, Oct 1, 2014 at 6:51 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am new to Spark Streaming. Can I think

problem with user@spark.apache.org spam filter

2014-10-03 Thread Andy Davidson
Any idea why my email was returned with the following error message? Thanks Andy This is the mail system at host smtprelay06.hostedemail.com. I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below. For further assistance,

bug with IPython notebook?

2014-10-07 Thread Andy Davidson
Hi I think I found a bug in the iPython notebook integration. I am not sure how to report it I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the cluster using the launch script provided by spark I start iPython notebook on my cluster master as follows and use an ssh tunnel

small bug in pyspark

2014-10-10 Thread Andy Davidson
Hi I am running spark on an ec2 cluster. I need to update python to 2.7. I have been following the directions on http://nbviewer.ipython.org/gist/JoshRosen/6856670 https://issues.apache.org/jira/browse/SPARK-922 I noticed that when I start a shell using pyspark, I correctly got python2.7, how

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-22 Thread Andy Davidson
On a related note, how are you submitting your job? I have a simple streaming proof of concept and noticed that everything runs on my master. I wonder if I do not have enough load for spark to push tasks to the slaves. Thanks Andy From: Daniel Mahler dmah...@gmail.com Date: Monday, October

java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-10-24 Thread Andy Davidson
Hi I am using spark streaming in Java. One of the problems I have is I need to save twitter status in JSON format as I receive them When I run the following code on my local machine. It work how ever all the output files are created in the current directory of the driver program. Clearly not a

spark streaming 1.51. uses very old version of twitter4j

2015-10-21 Thread Andy Davidson
While digging around the spark source today I discovered it depends on version 3.0.3 of twitter4j. This version was released on dec 2 2012. I noticed that the current version is 4.0.4 and was released on 6/23/2015 I am not aware of any particular problems. Are they any plans to upgrade? What is

Re: newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-29 Thread Andy Davidson
Hi Robin and Sabarish I figure out what the problem To submit my java app so that it runs in cluster mode (ie. I can close my laptop and go home) I need to do the following 1. make sure my jar file is available on all the slaves. Spark-submit will cause my driver to run on a slave, It will not

newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-28 Thread Andy Davidson
Hi I just created new cluster using the spark-c2 script from the spark-1.5.1-bin-hadoop2.6 distribution. The master and slaves seem to be up and running. I am having a heck of time figuring out how to submit apps. As a test I compile the sample JavaSparkPi example. I have copied my jar file to

Re: newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-28 Thread Andy Davidson
I forgot to mention. I do not have a preference for the cluster manager. I choose the spark-1.5.1-bin-hadoop2.6 distribution because I want to use hdfs. I assumed this distribution would use yarn. Thanks Andy From: Andrew Davidson Date: Wednesday, October 28,

streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-10-23 Thread Andy Davidson
I need to save the twitter status I receive so that I can do additional batch based processing on them in the future. Is it safe to assume HDFS is the best way to go? Any idea what is the best way to save twitter status to HDFS? JavaStreamingContext ssc = new JavaStreamingContext(jsc,

problems with spark 1.5.1 streaming TwitterUtils.createStream()

2015-10-21 Thread Andy Davidson
Hi I want to use twitters public streaming api to follow a set of ids. I want to implement my driver using java. The current TwitterUtils is a wrapper around twitter4j and does not expose the full twitter streaming api. I started by digging through the source code. Unfortunately I do not know

though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Andy Davidson
I just finished watching a great presentation from a recent spark summit on real time movie recommendations using spark. https://spark-summit.org/east-2015/talk/real-time-recommendations-using-spar k . For the purpose of email I am going to really simplify what they did. In general their real time

thought experiment: use spark ML to real time prediction

2015-11-10 Thread Andy Davidson
Lets say I have use spark ML to train a linear model. I know I can save and load the model to disk. I am not sure how I can use the model in a real time environment. For example I do not think I can return a ³prediction² to the client using spark streaming easily. Also for some applications the

Re: How to configure logging...

2015-11-11 Thread Andy Davidson
Hi Hitoshi Looks like you have read http://spark.apache.org/docs/latest/configuration.html#configuring-logging On my ec2 cluster I need to also do the following. I think my notes are not complete. I think you may also need to restart your cluster Hope this helps Andy # # setting up logger so

Re: streaming: missing data. does saveAsTextFile() append or replace?

2015-11-09 Thread Andy Davidson
e (usually a good idea in HDFS), you can use a > window function over the dstream and save the 'windowed' dstream instead. > > kind regards, Gerard. > > On Sat, Nov 7, 2015 at 10:55 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> Hi >> >&g

does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Andy Davidson
In R, its easy to split a data set into training, crossValidation, and test set. Is there something like this in spark.ml? I am using python of now. My real problem is I want to randomly select a relatively small data set to do some initial data exploration. Its not clear to me how using spark I

Re: bin/pyspark SparkContext is missing?

2015-11-16 Thread Andy Davidson
create a SparkContext instance: > > sc = SparkContext() > > On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson > <a...@santacruzintegration.com> wrote: >> I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 >> AWS cluster I created u

how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Andy Davidson
Hi I just set up a spark cluster on AWS ec2 cluster. In the past I have done a lot of work using RStudio on my local machine. Bin/sparkR looks interesting how ever it looks like you just get an R command line interpreter. Does anyone have an experience using something like RStudio or Rstudio

bin/pyspark SparkContext is missing?

2015-11-03 Thread Andy Davidson
I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 I have read the instructions for using iPython notebook on http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell I want to run the

best practices machine learning with python 2 or 3?

2015-11-03 Thread Andy Davidson
I am fairly new to python and am starting a new project that will want to make use of Spark and the python machine learning libraries (matplotlib, pandas, Š) . I noticed that the spark-c2 script set up my AWS cluster with python 2.6 and 2.7

streaming: missing data. does saveAsTextFile() append or replace?

2015-11-07 Thread Andy Davidson
Hi I just started a new spark streaming project. In this phase of the system all we want to do is save the data we received to hdfs. I after running for a couple of days it looks like I am missing a lot of data. I wonder if saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I

Re: bug: can not run Ipython notebook on cluster

2015-11-07 Thread Andy Davidson
What a BEAR! The following recipe worked for me. (took a couple of days hacking). I hope this improves the out of the box experience for others Andy My test program is now In [1]: from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") In [2]:

ipython notebook NameError: name 'sc' is not defined

2015-11-02 Thread Andy Davidson
Hi I recently installed a new cluster using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2. SparkPi sample app works correctly. I am trying to run iPython notebook on my cluster master and use an ssh tunnel so that I can work with the notebook in a browser running on my mac. Bellow is how I set up

bug: can not run Ipython notebook on cluster

2015-11-06 Thread Andy Davidson
Does anyone use iPython notebooks? I am able to use it on my local machine with spark how ever I can not get it work on my cluster. For unknown reason on my cluster I have to manually create the spark context. My test code generated this exception Exception: Python in worker has different

java TwitterUtils.createStream() how create "user stream" ???

2015-10-19 Thread Andy Davidson
Hi I wrote a little prototype that created a ³public stream² now I want to convert it to read tweets for a large number of explicit users. I to create a ³user stream² or a ³site stream". According to the twitter developer doc I should be able to set the ³follows² parameter to a list of users I

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
On my master grep native /root/spark/conf/spark-env.sh SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/l ib/native/" $ ls /root/ephemeral-hdfs/lib/native/ libhadoop.a libhadoop.solibhadooputils.a libsnappy.so libsnappy.so.1.1.3 Linux-i386-32

newbie simple app, small data set: Py4JJavaError java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-18 Thread Andy Davidson
Hi I am working on a spark POC. I created a ec2 cluster on AWS using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 Bellow is a simple python program. I am running using IPython notebook. The notebook server is running on my spark master. If I run my program more than 1 once using my large data set, I

Re: epoch date format to normal date format while loading the files to HDFS

2015-12-08 Thread Andy Davidson
Hi Sonia I believe you are using java? Take a look at Java Date I am sure you will find lots of examples of how to format dates Enjoy share Andy /** * saves tweets to disk. This a replacement for * @param tweets * @param outputURI */ private static void

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-12-02 Thread Andy Davidson
Hi Ted an Felix From: Ted Yu Date: Sunday, November 29, 2015 at 10:37 AM To: Andrew Davidson Cc: Felix Cheung , "user @spark" Subject: Re: possible bug spark/python/pyspark/rdd.py

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Andy Davidson
___ > From: Ted Yu <yuzhih...@gmail.com> > Sent: Friday, November 27, 2015 11:50 AM > Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash() > To: Felix Cheung <felixcheun...@hotmail.com> > Cc: Andy Davidson <a...@santacruzintegration.com&g

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Andy Davidson
Hi Kyohey I think you need to pass the argument --master $MASTER_URL \ master_URL is something like spark://ec2-54-215-112-121.us-west-1.compute.amazonaws.com:7077 Its the public url to your master Andy From: Kyohey Hamaguchi Date: Friday, December 4, 2015

newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-03 Thread Andy Davidson
About 2 months ago I used spark-ec2 to set up a small cluster. The cluster runs a spark streaming app 7x24 and stores the data to hdfs. I also need to run some batch analytics on the data. Now that I have a little more experience I wonder if this was a good way to set up the cluster the following

Re: How and where to update release notes for spark rel 1.6?

2015-12-03 Thread Andy Davidson
Hi JB Do you know where I can find instructions for upgrading an existing installation? I search the link you provided for ³update² and ³upgrade² Kind regards Andy From: Jean-Baptiste Onofré Date: Thursday, December 3, 2015 at 5:29 AM To: "user @spark"

Example of a Trivial Custom PySpark Transformer

2015-12-07 Thread Andy Davidson
FYI Hopeful other will find this example helpful Andy Example of a Trivial Custom PySpark Transformer ref: * * NLTKWordPunctTokenizer example * * pyspark.sql.functions.udf

issue creating pyspark Transformer UDF that creates a LabeledPoint: AttributeError: 'DataFrame' object has no attribute '_get_object_id'

2015-12-07 Thread Andy Davidson
Hi I am running into a strange error. I am trying to write a transformer that takes in to columns and creates a LabeledPoint. I can not figure out why I am getting AttributeError: 'DataFrame' object has no attribute Œ_get_object_id¹ I am using spark-1.5.1-bin-hadoop2.6 Any idea what I am

newbie how to upgrade a spark-ec2 cluster?

2015-12-02 Thread Andy Davidson
I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster. Any idea how I can upgrade to 1.5.2 prebuilt binary? Also if I choose to build the binary, how would I upgrade my cluster? Kind regards Andy

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Andy Davidson
My understanding is one of the biggest advantages of DF¹s is that schema information allows a lot of optimization. For example assume frame had many column but your computation only uses 2 columns. No need to load all the data. Andy From: "email2...@gmail.com" Date:

Re: looking for Spark streaming unit example written in Java

2015-12-17 Thread Andy Davidson
ion.immutable.IndexedSeq > nestedSuites(); > > public static void > org$scalatest$Suite$_setter_$styleName_$eq(java.lang.String); > > public static org.scalatest.ConfigMap testDataFor$default$2(); > > public static org.scalatest.TestData testDataFor(java.lang.String, >

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andy Davidson
y other process using port 7077? > > On 10 December 2015 at 08:52, Andy Davidson <a...@santacruzintegration.com> > wrote: >> Hi >> >> I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My job >> seems to run with out any problem. >>

architecture though experiment: what is the advantage of using kafka with spark streaming?

2015-12-10 Thread Andy Davidson
I noticed that many people are using Kafka and spark streaming. Can some one provide a couple of use case I image some possible use cases might be Is the purpose using Kafka 1. provide some buffering? 2. implementing some sort of load balancing for the over all system? 3. Provide filtering

Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andy Davidson
Hi I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My job seems to run with out any problem. Kind regards Andy + /root/spark/bin/spark-submit --class com.pws.spark.streaming.IngestDriver --master spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077

cluster mode uses port 6066 Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-11 Thread Andy Davidson
de submissions on port 6066. This is > because standalone cluster mode uses a REST API to submit applications by > default. If you submit to port 6066 instead the warning should go away. > > -Andrew > > > 2015-12-10 18:13 GMT-08:00 Andy Davidson <a...@santacruzintegration

looking for Spark streaming unit example written in Java

2015-12-15 Thread Andy Davidson
I am having a heck of a time writing a simple Junit test for my spark streaming code. The best code example I have been able to find is http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ unfortunately it is written in Spock and Scala. I am having trouble figuring out how to get it to

looking for a easier way to count the number of items in a JavaDStream

2015-12-15 Thread Andy Davidson
I am writing a JUnit test for some simple streaming code. I want to make assertions about how many things are in a given JavaDStream. I wonder if there is an easier way in Java to get the count? I think there are two points of friction. 1. is it easy to create an accumulator of type double or

problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Andy Davidson
I am having a heck of a time writing a simple transformer in Java. I assume that my Transformer is supposed to append a new column to the dataFrame argument. Any idea why I get the following exception in Java 8 when I try to call DataFrame withColumn()? The JavaDoc says withColumn() "Returns a new

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Micheal I really appreciate your help. I The following code works. Is there a way this example can be added to the distribution to make it easier for future java programmers? It look me a long time get to this simple solution. I'll need to tweak this example a little to work with the new

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Andy Davidson
code here > } > }, DataTypes.StringType); > DataFrame transformed = df.withColumn("filteredInput", > expr("stem(rawInput)")); > > On Mon, Jan 4, 2016 at 8:08 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> I am having a heck of a time

Re: How to use Java8

2016-01-05 Thread Andy Davidson
Hi Sea From: Sea <261810...@qq.com> Date: Tuesday, January 5, 2016 at 6:16 PM To: "user @spark" Subject: How to use Java8 > Hi, all > I want to support java8, I use JDK1.8.0_65 in production environment, but > it doesn't work. Should I build spark using jdk1.8,

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Michael I am happy to add some documentation. I forked the repo but am having trouble with the markdown. The code examples are not rendering correctly. I am on a mac and using https://itunes.apple.com/us/app/marked-2/id890031187?mt=12 I use a emacs or some other text editor to change the md.

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Andy Davidson
would be awesome if you could test with > 1.6 and see if things are any better? > > On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> I am using spark 1.5.1. I am running into some memory problems with a java >> unit test.

does HashingTF maintain a inverse index?

2015-12-31 Thread Andy Davidson
Hi I am working on proof of concept. I am trying to use spark to classify some documents. I am using tokenizer and hashingTF to convert the documents into vectors. Is there any easy way to map feature back to words or do I need to maintain the reverse index my self? I realize there is a chance

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
Hi Unk1102 I also had trouble when I used coalesce(). Reparation() worked much better. Keep in mind if you have a large number of portions you are probably going have high communication costs. Also my code works a lot better on 1.6.0. DataFrame memory was not be spilled in 1.5.2. In 1.6.0

how to extend java transformer from Scala UnaryTransformer ?

2016-01-01 Thread Andy Davidson
I am trying to write a trivial transformer I use use in my pipeline. I am using java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as an example. This should be very easy how ever I do not understand Scala, I am having trouble debugging the following exception. Any help

trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
I am using spark 1.5.1. I am running into some memory problems with a java unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I want to better understand what is going on so I can write better code in the future. The test runs on a Mac, master="Local[2]" I have a java unit

how to debug java.lang.IllegalArgumentException: object is not an instance of declaring class

2015-12-24 Thread Andy Davidson
Hi Any idea how I can debug this problem. I suspect the problem has to do with how I am converting a JavaRDD> to a DataFrame. Is it boxing problem? I tried to use long and double instead of Long and Double when ever possible. Thanks in advance, Happy Holidays. Andy

Re: how to debug java.lang.IllegalArgumentException: object is not an instance of declaring class

2015-12-24 Thread Andy Davidson
Problem must be with how I am converting JavaRDD> to a DataFrame. Any suggestions? Most of my work has been done using pySpark. Tuples are a lot harder to work with in Java. JavaRDD> predictions = idLabeledPoingRDD.map((Tuple2 t2)

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Andy Davidson
Hi Yanbo I use spark.csv to load my data set. I work with both Java and Python. I would recommend you print the first couple of rows and also print the schema to make sure your data is loaded as you expect. You might find the following code example helpful. You may need to programmatically set

what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
erstanding data frame memory usage ³java.io.IOException: Unable to acquire memory² > Unfortunately in 1.5 we didn't force operators to spill when ran out of memory > so there is not a lot you can do. It would be awesome if you could test with > 1.6 and see if things are any better? >

trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-21 Thread Andy Davidson
I am trying to port the following python function to Java 8. I would like my java implementation to implement Transformer so I can use it in a pipeline. I am having a heck of a time trying to figure out how to create a Column variable I can pass to DataFrame.withColumn(). As far as I know

is Kafka Hard to configure? Does it have a high cost of ownership?

2015-12-21 Thread Andy Davidson
Hi I realize this is a little off topic. My project needs to install something like Kafka. The engineer working on that part of the system has been having a lot of trouble configuring a single node implementation. He has lost a lot of time and wants to switch to something else. Our team does not

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
. > > Regards > Sab > > On 21-Nov-2015 11:59 pm, "Andy Davidson" <a...@santacruzintegration.com> > wrote: >> I start working on a very simple ETL pipeline for a POC. It reads a in a data >> set of tweets stored as JSON strings on in HDFS and randomly

Re: Adding more slaves to a running cluster

2015-11-25 Thread Andy Davidson
Hi Dillian and Nicholas If you figure out how to do this please post your recipe. It would be very useful andy From: Nicholas Chammas Date: Wednesday, November 25, 2015 at 11:36 AM To: Dillian Murphey , "user @spark"

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
sands of empty files being created on HDFS? > > Hi Andy > > You can try sc.wholeTextFiles() instead of sc.textFile() > > Regards > Sab > > On 24-Nov-2015 4:01 am, "Andy Davidson" <a...@santacruzintegration.com> wrote: >> Hi Xiao and Sabarish &g

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
ms > > 15/11/23 21:05:10 INFO python.PythonRunner: Times: total = 15375, boot = -392, > init = 403, finish = 15364 > > 15/11/23 21:05:10 INFO storage.ShuffleBlockFetcherIterator: Getting 300 > non-empty blocks out of 300 blocks > > 15/11/23 21:05:10 INFO storage.ShuffleBlockFetc

possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-26 Thread Andy Davidson
I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I get and exception 'Randomness of hash of string should be disabled via PYTHONHASHSEED¹. Is there any reason rdd.py should not just set PYTHONHASHSEED

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
I forgot to mention. I am using spark-1.5.1-bin-hadoop2.6 From: Andrew Davidson Date: Tuesday, November 17, 2015 at 2:26 PM To: "user @spark" Subject: Re: WARN LoadSnappy: Snappy native library not loaded > FYI > > After 17 min. only

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
FYI After 17 min. only 26112/228155 have succeeded This seems very slow Kind regards Andy From: Andrew Davidson Date: Tuesday, November 17, 2015 at 2:22 PM To: "user @spark" Subject: WARN LoadSnappy: Snappy native library not

WARN LoadSnappy: Snappy native library not loaded

2015-11-17 Thread Andy Davidson
I started a spark POC. I created a ec2 cluster on AWS using spark-c2. I have 3 slaves. In general I am running into trouble even with small work loads. I am using IPython notebooks running on my spark cluster. Everything is painfully slow. I am using the standAlone cluster manager. I noticed that

newbie: unable to use all my cores and memory

2015-11-19 Thread Andy Davidson
I am having a heck of a time figuring out how to utilize my cluster effectively. I am using the stand alone cluster manager. I have a master and 3 slaves. Each machine has 2 cores. I am trying to run a streaming app in cluster mode and pyspark at the same time. t1) On my console I see *

spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
I am working on a simple POS. I am running into a really strange problem. I wrote a java streaming app that collects tweets using the spark twitter package and stores the to disk in JSON format. I noticed that when I run the code on my mac. The file are written to the local files system as I

Re: thought experiment: use spark ML to real time prediction

2015-11-22 Thread Andy Davidson
>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> -- >>>>> Web: https://www.dbtsai.com >>>>> PGP Key ID: 0xAF08DF8D >>>>

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
t; >Those are empty partitions. I don't see the number of partitions >specified in code. That then implies the default parallelism config is >being used and is set to a very high number, the sum of empty + non empty >files. >Regards >Sab >On 21-Nov-2015 11:59 pm, "Andy Davi

ml.classification.NaiveBayesModel how to reshape theta

2016-01-12 Thread Andy Davidson
I am trying to debug my trained model by exploring theta Theta is a Matrix. The java Doc for Matrix says that it is column major formate I have trained a NaiveBayesModel. Is the number of classes == to the number of rows? int numRows = nbModel.numClasses(); int numColumns =

Re: pre-install 3-party Python package on spark cluster

2016-01-11 Thread Andy Davidson
I use https://code.google.com/p/parallel-ssh/ to upgrade all my slaves From: "taotao.li" Date: Sunday, January 10, 2016 at 9:50 PM To: "user @spark" Subject: pre-install 3-party Python package on spark cluster > I have a spark cluster, from

has any one implemented TF_IDF using ML transformers?

2016-01-15 Thread Andy Davidson
I wonder if I am missing something? TF-IDF is very popular. Spark ML has a lot of transformers how ever it TF_IDF is not supported directly. Spark provide a HashingTF and IDF transformer. The java doc http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf Mentions you can

Re: trouble calculating TF-IDF data type mismatch: '(tf * idf)' requires numeric type, not vector;

2016-01-13 Thread Andy Davidson
you¹ll need the following function if you want to run the test code Kind regards Andy private DataFrame createData(JavaRDD rdd) { StructField id = null; id = new StructField("id", DataTypes.IntegerType, false, Metadata.empty()); StructField label = null;

trouble calculating TF-IDF data type mismatch: '(tf * idf)' requires numeric type, not vector;

2016-01-13 Thread Andy Davidson
Bellow is a little snippet of my Java Test Code. Any idea how I implement member wise vector multiplication? Also notice the idf value for ŒChinese¹ is 0.0? The calculation is ln((4+1) / (6/4 + 1)) = ln(2) = 0.6931 ?? Also any idea if this code would work in a pipe line? I.E. Is the pipeline

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Andy Davidson
> Yep, row of Matrix theta is the number of classes and column of theta is the > number of features. > > 2016-01-13 10:47 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: >> I am trying to debug my trained model by exploring theta >> Theta is a Matrix. The j

Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Andy Davidson
Are you using 1.6.0 or an older version? I think I remember something in 1.5.1 saying save was not implemented in python. The current doc does not say anything about save() http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
park source code > Have you followed the guide on how to import spark into eclipse > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#Usefu > lDeveloperTools-Eclipse ? > > On 18 January 2016 at 13:04, Andy Davidson <a...@santacruzintegration.com> > wrote:

trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
Hi My project is implemented using Java 8 and Python. Some times its handy to look at the spark source code. For unknown reason if I open a spark project my java projects show tons of compiler errors. I think it may have something to do with Scala. If I close the projects my java code is fine.

Re: has any one implemented TF_IDF using ML transformers?

2016-01-18 Thread Andy Davidson
k" <user@spark.apache.org> Subject: Re: has any one implemented TF_IDF using ML transformers? > Hi Andy, > > Actually, the output of ML IDF model is the TF-IDF vector of each instance > rather than IDF vector. > So it's unnecessary to do member wise multiplication to calcu

  1   2   3   >