Re: ML Algos

2013-08-16 Thread Lijie Xu
Thanks for your quick reply. I like the new features. There is another
related question about distributed ML.
How do you think about the architecutre of Google's deep learning (
http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf)?
It contains parameter server, model replicas and data shards. Does MLBase
have similar components?


On Sat, Aug 17, 2013 at 10:37 AM, Gowtham N wrote:

> Is anyone working on neural networks, ADM, Collaborative Filtering etc?
>
>


Re: ML Algos

2013-08-16 Thread Gowtham N
Is anyone working on neural networks, ADM, Collaborative Filtering etc?


Re: ML Algos

2013-08-16 Thread Matei Zaharia
On Aug 15, 2013, at 7:13 PM, Lijie Xu  wrote:

> 3) MLBase may require Spark to provide some new features for implementing 
> some specific algorithms. Is there any? Or you have added some new 
> fundamental features which are not supported in Spark-0.7?

On this particular aspect, we actually have a few small changes in 0.8 that are 
required in MLlib -- one is an improvement to the semantics of takeSample to 
allow over-sampling an RDD, and one is exposing each RDD's storage level as a 
public API so we can check whether it's cached and warn you otherwise. So it 
would be better to run this over 0.8 than 0.7. That said, you might be able to 
port many algorithms back to 0.7.

The plan is to release 0.8 this month, so it won't be too far away.

Matei

Re: ML Algos

2013-08-16 Thread Ameet Talwalkar
Thanks for your email -- I've responded inline.


On Thu, Aug 15, 2013 at 7:13 PM, Lijie Xu  wrote:

> Quite interesting. I have some questions about this amazing project:
> 1) In "Logistic Regression -­‐ Weak Scaling", MLlib and VW run slower in
> each processor for fixed problem while data/machines are increasing. Could
> you explain which component causes this performance degradataion problem.
> Synchronization, network traffic, data partition or etc. ?
>

This is a good question, and to be honest, we still need to investigate
this further to get a better understanding of what's going on here.


> 2) What's the relationship between MLBase and GraphX?
>

Right now the two projects are being developed separately.  As of now
MLbase does not support graph-based functionality, though moving forward,
it would be quite interesting to extend the MLI to include graph-based
primitives and leverage GraphX as a runtime.


>
> 3) MLBase may require Spark to provide some new features for implementing
> some specific algorithms. Is there any? Or you have added some new
> fundamental features which are not supported in Spark-0.7?
>

As MLbase is a relatively new project, we have been developing MLlib and
MLI to be compatible with Spark-0.8.

>
> On Fri, Aug 16, 2013 at 4:01 AM, Ameet Talwalkar 
> wrote:
>
>> The following 
>> slides 
>> summarize
>> the ML algorithms to be included in MLlib (slide 49) and MLI (slide 107) in
>> the near future.  We plan to include additional
>> classification/regression/CF/clustering/optimization primitives over time
>> with the help of the open-source community, and based on feedback from
>> users about desired functionality.  Moreover, we ultimately aim to add
>> advance ML functionality, as briefly described in slide 140.
>>
>> -Ameet
>>
>>
>> On Thu, Aug 15, 2013 at 12:32 PM, Gowtham N wrote:
>>
>>> Hi,
>>>
>>> Can someone give details about the future work in ML algorithms (Inside
>>> mllib folder).
>>> Currently there are some basic algorithms implemented. Is there any
>>> roadmap regarding what ML algorithms are required?
>>>
>>
>>
>


Re: New release of Spark and Shark on Amazon EMR

2013-08-16 Thread Matei Zaharia
Cool, thanks for doing this!

Matei

On Aug 16, 2013, at 11:27 AM, Parviz deyhim  wrote:

> Amazon EMR now has the latest version of Spark 0.7.3 and Shark 0.7
> 
> Let me know if you have any questions. 
> 
> Thanks,
> Parviz



New release of Spark and Shark on Amazon EMR

2013-08-16 Thread Parviz deyhim
Amazon EMR now has the latest version of Spark 0.7.3 and Shark 0.7

Let me know if you have any questions.

Thanks,
Parviz


Re: Fail to run on yarn with release version?

2013-08-16 Thread Tom Graves
Its looks like a config issue. Do you have HADOOP_CONF_DIR and HADOOP_PREFIX 
set and pointing to the proper install/config locations for your cluster?

Tom



 From: "Liu, Raymond" 
To: "user@spark.incubator.apache.org"  
Sent: Friday, August 16, 2013 2:46 AM
Subject: Fail to run on yarn with release version?
 

Hi

    I could run spark trunk code on top of yarn 2.0.5-alpha by 

SPARK_JAR=./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar ./run 
spark.deploy.yarn.Client \
  --jar examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar \
  --class spark.examples.SparkPi \
  --args yarn-standalone \
  --num-workers 3 \
  --worker-memory 2g \
  --worker-cores 2


While, if I use make-distribution.sh to build a release package and use this 
package on the cluster. Then it fails to run up. I do copy examples jar to 
jars/ dir. 
The other mode say standalone/mesos/local runs well with the release package.

The error encounter is :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
        at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2265)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2272)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2311)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2293)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:317)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:163)
        at spark.deploy.yarn.Client.prepareLocalResources(Client.scala:117)
        at spark.deploy.yarn.Client.run(Client.scala:59)
        at spark.deploy.yarn.Client$.main(Client.scala:318)
        at spark.deploy.yarn.Client.main(Client.scala)


google result seems leading to hdfs core-default.xml not included in the fat 
jar. While I checked that it did.
Any idea on this issue? Thanks!


Best Regards,
Raymond Liu

Fail to run on yarn with release version?

2013-08-16 Thread Liu, Raymond
Hi

I could run spark trunk code on top of yarn 2.0.5-alpha by 

SPARK_JAR=./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar ./run 
spark.deploy.yarn.Client \
  --jar examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar \
  --class spark.examples.SparkPi \
  --args yarn-standalone \
  --num-workers 3 \
  --worker-memory 2g \
  --worker-cores 2


While, if I use make-distribution.sh to build a release package and use this 
package on the cluster. Then it fails to run up. I do copy examples jar to 
jars/ dir. 
The other mode say standalone/mesos/local runs well with the release package.

The error encounter is :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2265)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2272)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2311)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2293)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:317)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:163)
at spark.deploy.yarn.Client.prepareLocalResources(Client.scala:117)
at spark.deploy.yarn.Client.run(Client.scala:59)
at spark.deploy.yarn.Client$.main(Client.scala:318)
at spark.deploy.yarn.Client.main(Client.scala)


google result seems leading to hdfs core-default.xml not included in the fat 
jar. While I checked that it did.
Any idea on this issue? Thanks!


Best Regards,
Raymond Liu